key: cord-020872-frr8xba6 authors: santosh, tokala yaswanth sri sai; sanyal, debarshi kumar; bhowmick, plaban kumar; das, partha pratim title: dake: document-level attention for keyphrase extraction date: 2020-03-24 journal: advances in information retrieval doi: 10.1007/978-3-030-45442-5_49 sha: doc_id: 20872 cord_uid: frr8xba6 keyphrases provide a concise representation of the topical content of a document and they are helpful in various downstream tasks. previous approaches for keyphrase extraction model it as a sequence labelling task and use local contextual information to understand the semantics of the input text but they fail when the local context is ambiguous or unclear. we present a new framework to improve keyphrase extraction by utilizing additional supporting contextual information. we retrieve this additional information from other sentences within the same document. to this end, we propose document-level attention for keyphrase extraction (dake), which comprises bidirectional long short-term memory networks that capture hidden semantics in text, a document-level attention mechanism to incorporate document level contextual information, gating mechanisms which help to determine the influence of additional contextual information on the fusion with local contextual information, and conditional random fields which capture output label dependencies. our experimental results on a dataset of research papers show that the proposed model outperforms previous state-of-the-art approaches for keyphrase extraction. keyphrase extraction is the task of automatically extracting words or phrases from a text, which concisely represent the essence of the text. because of the succinct expression, keyphrases are widely used in many tasks like document retrieval [13, 25] , document categorization [9, 12] , opinion mining [3] and summarization [24, 31] . figure 1 shows an example of a title and the abstract of a research paper along with the author-specified keyphrases highlighted in bold. present methods for keyphrase extraction follow a two-step procedure where they select important phrases from the document as potential keyphrase candidates by heuristic rules [18, 28, 29] and then the extracted candidate phrases are ranked either by unsupervised approaches [17, 21, 27] or supervised approaches [18, 22, 29] . unsupervised approaches score those candidate phrases based on individual words comprising the candidate phrases. they utilize various scoring measures based on the informativeness of the word with respect to the whole document [10] . other paradigms utilize graph-based ranking algorithms wherein each word in the document is mapped to a node in the graph and the connecting edges in the graph represent the association patterns among the words in the document. then, the scores of the individual words are estimated using various graph centrality measures [6, 21, 27] . on the other hand, supervised approaches [4, 14] use binary classification to label the extracted candidate phrases as keyphrases or non-keyphrases, based on various features such as, tf-idf, part-of-speech (pos) tags, and the position of phrases in the document. the major limitation of these supervised approaches is that they classify the labels of each candidate phrase independently without taking into account the dependencies that could potentially exist between neighbouring labels and they also ignore the semantic meaning of the text. to overcome the above stated limitation, [8] formulated keyphrase extraction as a sequence labeling task and used linear-chain conditional random fields for this task. however, this approach does not explicitly take into account the long-term dependencies and semantics of the text. more recently, to capture both the semantics of the text as well as the dependencies among the labels of neighboring words [1] used a deep learning-based approach called bilstm-crf which combines a bi-directional long short-term memory (bilstm) layer that models the sequential input text with a conditional random field (crf) layer that captures the dependencies in the output. title: dce-mri data analysis for cancer area classification. abstract: the paper aims at improving the support of medical researchers in the context of in-vivo cancer imaging. [..] the proposed approach is based on a three-step procedure: i) robust feature extraction from raw time-intensity curves, ii) voxel segmentation, and iii) voxel classification based on a learningby-example approach. finally, in the third step, a support vector machine (svm) is trained to classify voxels according to the labels obtained by the clustering phase. the above mentioned approaches treat keyphrase extraction as a sentencelevel task where sentences in the same document are viewed as independent. when labeling a word, local contextual information from the surrounding words is crucial because the context gives insight to the semantic meaning of the word. however, there are many instances in which the local context is ambiguous or lacks sufficient information. if the model has access to supporting information that provides additional context, the model may use this additional supporting information to predict the label correctly. such additional supporting information may be found from other sentences in the same document from which the query sentence is taken. to utilize this additional supporting information, we propose a document-level attention mechanism inspired from [20, 30] ; it dynamically weights the additional supporting information emphasizing the most relevant information from each supporting sentence with respect to the local context. but leveraging this additional supporting information has a downside of introducing noise into the representations. to alleviate this problem, we use a gating mechanism [20, 30] that balances the influence of the local contextual representations and the additional supporting information from the document-level contextual representations. to this end, in this paper, we propose document-level attention for keyphrase extraction (dake). it initially produces representations for each word that encode the local context from the query sentence using bilstm, then uses a document-level attention mechanism to incorporate the most relevant information from each supporting information with respect to the local context, and employs a gating mechanism to filter out the irrelevant information. finally, it uses a crf layer which captures output label dependencies to decode the gated local and the document-level contextual representations to predict the label. the main contributions of this paper are as follows: -we propose dake, a bilstm-crf model augmented with document-level attention and a gating mechanism for improved keyword extraction from research papers. -experimental results on a dataset of research papers show that dake outperforms previous state-of-the-art approaches. we formally describe the keyphrase extraction task as follows: given a sentence, s = {w 1 , w 2 , . . . , w n } where n is the length of the sentence, predict the labels sequence y = {y 1 , y 2 , . . . , y n } where y i is the label corresponding to word w i and it can kp (keyphrase word) or not-kp (not a keyphrase word). every longest sequence of kp words in a sentence is a keyphrase. the main components in our proposed architecture, dake, are: word embedding layer, sentence encoding layer, document-level attention mechanism, gating mechanism, context augmenting layer and label sequence prediction layer. the first layer produces word embeddings of the sentence from which the second layer generates word representations that encode the local context from the query sentence. then the document-level attention mechanism extracts supporting information from other sentences in the document to enrich the current word representation. subsequently, we utilize a gating mechanism to filter out the irrelevant information from each word representation. the next layer fuses the local and the global contexts into each word representation. finally, we feed these word representations into the crf layer which acts as a decoder to predict the label, kp or not-kp, associated with each word. the model is trained in an end-to-end fashion. . . , w in } is a sequence of n words, we transform each word w ij in the sentence s i into a vector x ij using pre-trained word embeddings. we use a bilstm [11] to obtain the hidden representation h i of the sentence s i . a bilstm comprises a forward-lstm which reads the input sequence in the original direction and a backward-lstm which reads it in the opposite direction. we apply forward-lstm on the sentence we concatenate the outputs of the forward and the backward lstms to obtain the local contextual representation here, : denotes concatenation operation. succinctly, many keyphrase mentions are tagged incorrectly in current approaches including the bilstm-crf model [1] due to ambiguous contexts present in the input sentence. in cases where a sentence is short or highly ambiguous, the model may either fail to identify keyphrases due to insufficient information or make wrong predictions by using noisy context. we hypothesize that this limitation can be alleviated using additional supporting information from other sentences within the same document. to extract this global context, we need vector representations of other sentences in the same document d. we utilize bert [5] as a sentence encoder to obtain representations for the sentences in d. given an input sentence s l in d, we extract the final hidden state of the [cls] token as the representation h l of the sentence, where [cls] is the special classification embedding in bert. then, for each word, w ij in the input sentence s i , we apply an attention mechanism to weight the supporting sentences in d as follows where w 1 , w 2 are trainable weight matrices and b 1 is a trainable bias vector. we compute the final representation of supporting information ash ij = m l=1 α l ij h l . for each word w ij ,h ij captures the document-level supporting evidence with regard to w ij . though the above supporting information from the entire document is valuable to the prediction, we must mitigate the influence of the distant supporting information as the prediction should be made primarily based on the local context. therefore, we apply a gating mechanism to constrain this influence and enable the model to decide the amount of the supporting information that should be incorporated in the model, which is given as follows: where denotes hadamard product and w 3 ,w 4 ,w 5 ,w 6 ,w 7 ,w 8 are trainable weight matrices and b 2 ,b 3 ,b 4 are trainable bias vectors. d ij is the representation for the gated supporting evidence for w ij . for each word w ij of sentence s i , we concatenate its local contextual representation h ij and gated document-level supporting contextual representation d ij to obtain its final representation a ij = [h ij : d ij ], where : denotes concatenation operation. these final representations a i = {a i1 , a i2 , . . . , a in } of sentence s i are fed to another bilstm to further encode the local contextual features along with supporting contextual information into unified representations . the output of this encoding captures the interaction among the context words conditioned on the supporting information. this is different from the initial encoding layer, which captures the interaction among words of the sentence independent of the supporting information. the obtained contextual representations c i of query sentence s i are given as input sequence to a crf layer [16] that produces a probability distribution over the output label sequence using the dependencies among the labels of the entire input sequence. in order to efficiently find the best sequence of labels for an input sentence, the viterbi algorithm [7] is used. we use the dataset from [19] which comprises metadata of papers from several online digital libraries. the dataset contains metadata for 567,830 papers with a clear split as train, validation, and test sets provided by the authors, as follows: 527,830 were used for model training, 20,000 were used for validation and the rest 20,000 were used for testing. we refer to these sets as kp527k, kp20k-v and kp20k respectively. the metadata of each paper consists of title, abstract, and authorassigned keyphrases. the title and abstract of each paper are used to extract keyphrases, whereas the author-input keyphrases are used as gold-standard for evaluation. we compare our approach, dake with the following baselines: bi-lstm-crf [1] , crf [8] , bi-lstm [1] , copy-rnn [19] , kea [29] , tf-idf, textrank [21] and singlerank [27] . we also carry out an ablation test to understand the effectiveness of document-level attention and gating mechanism components by removing them. similar to previous works, we evaluate the predictions of each method against the author-specified keyphrases that can be located in the corresponding paper abstracts in the dataset ("gold standard"). we present results for all our experiments using the precision, recall, and f1-score measures. for comparison of the methods, we choose the f1-score, which is the harmonic mean of precision and recall. we use pre-trained word embedding vectors obtained using glove [23] . we use scibert [2] , a bert model trained on scientific text for the sentence encoder. for word representations, we use 300-dimensional pre-trained word embeddings and for sentence encoder, we use 768 dimensional representation obtained using scibert. the hidden state of the lstm is set to 300 dimensions. the model is trained end-to-end using the adam optimization method [15] . the learning rate is initially set as 0.001 and decayed by 0.5 after each epoch. for regularization to avoid over-fitting, dropout [26] is applied to each layer. we select the model with the best f1-score on the validation set, kp20k-v. table 1a shows the results of our approach in comparison to various baselines. our approach, dake outperforms all baselines in terms of the f1-score. tf-idf, textrank and singlerank are unsupervised extractive approaches while kea, bi-lstm-crf, crf, bi-lstm follow supervised extractive approach. copyrnn is a recently proposed generative model based on sequence-to-sequence learning along with a copying mechanism. for the unsupervised models and the sequence-to-sequence learning model, we report the performance at top-5 predicted keyphrases since top-5 showed highest performance in the previous works for these models. from table 1a , we observe that the deep learning-based approaches perform better than the traditional feature-based approaches. this indicates the importance of understanding the semantics of the text for keyphrase extraction. bilstm-crf yields better results in terms of the f1-score over crf (improvement of f1-score by 18.17% from 17.46% to 35.63%) and bilstm (improvement of f1-score by 18.88% from 16.75% to 35.63%) models alone. this result indicates that the combination of bilstm, which is powerful in capturing the semantics of the textual content, with crf, which captures the dependencies among the output labels, helped boost the performance in identifying keyphrases. our proposed method, dake outperforms the bilstm-crf (improvement of f1-score by 6.67% from 35.63% to 42.30%) approach, which indicates that the incorporation of additional contextual information from other sentences in the document into the bilstm-crf model helps to further boost the performance. table 1b shows the ablation study. we observe that document-level attention increases the f1-score of the baseline bilstm-crf by 0.84% (from 35.63% to 36.47%). this validates our hypothesis that additional supporting information boosts the performance for keyphrase extraction. but leveraging this additional supporting information has a downside of introducing noise into the representations, and to alleviate this, we used a gating mechanism which boosted the f1-score by 1.62% (from 36.47% to 38.09%). document-level attention did not show great improvement when it has only one layer of bisltm because the final tagging predictions mainly depend on the local context of each word while additional context only supplements extra information. therefore, our model needs another layer of bilstm to encode the sequential intermediate vectors containing additional context and local context, as evidenced from our f1-score improvement by 4.21% (from 38.09% to 42.30%). when crf is removed from dake, the f1-score falls by 3.09%, showing that crf successfully captures the output label dependencies. we proposed an architecture, dake, for keyword extraction from documents. it uses a bilstm-crf network enhanced with a document-level attention mechanism to incorporate contextual information from the entire document, and gating mechanisms to balance between the global and the local contexts. it outperforms existing keyphrase extraction methods on a dataset of research papers. in future, we would like to integrate the relationships between documents such as those available from a citation network by enhancing our approach with contexts in which the document is referenced within a citation network. bi-lstm-crf sequence labeling for keyphrase extraction from scholarly documents scibert: pretrained contextualized embeddings for scientific text opinion expression mining by exploiting keyphrase extraction citation-enhanced keyphrase extraction from research papers: a supervised approach bert: pre-training of deep bidirectional transformers for language understanding positionrank: an unsupervised approach to keyphrase extraction from scholarly documents the viterbi algorithm incorporating expert knowledge into keyphrase extraction corephrase: keyphrase extraction for document clustering conundrums in unsupervised keyphrase extraction: making sense of the state-of-the-art long short-term memory a study on automatically extracted keywords in text categorization phrasier: a system for interactive document retrieval using keyphrases automatic keyphrase extraction from scientific articles adam: a method for stochastic optimization conditional random fields: probabilistic models for segmenting and labeling sequence data unsupervised keyphrase extraction: introducing new kinds of words to keyphrases human-competitive tagging using automatic keyphrase extraction deep keyphrase generation document-level neural machine translation with hierarchical attention networks textrank: bringing order into text keyphrase extraction in scientific publications glove: global vectors for word representation citation summarization through keyphrase extraction enhancing access to scholarly publications with surrogate resources dropout: a simple way to prevent neural networks from overfitting single document keyphrase extraction using neighborhood knowledge ptr: phrase-based topical ranking for automatic keyphrase extraction in scientific publications kea: practical automated keyphrase extraction global attention for name tagging world wide web site summarization. web intell. agent syst project sponsored by ministry of human resource development, government of india at iit kharagpur. key: cord-020830-97xmu329 authors: ghanem, bilal; karoui, jihen; benamara, farah; rosso, paolo; moriceau, véronique title: irony detection in a multilingual context date: 2020-03-24 journal: advances in information retrieval doi: 10.1007/978-3-030-45442-5_18 sha: doc_id: 20830 cord_uid: 97xmu329 this paper proposes the first multilingual (french, english and arabic) and multicultural (indo-european languages vs. less culturally close languages) irony detection system. we employ both feature-based models and neural architectures using monolingual word representation. we compare the performance of these systems with state-of-the-art systems to identify their capabilities. we show that these monolingual models trained separately on different languages using multilingual word representation or text-based features can open the door to irony detection in languages that lack of annotated data for irony. figurative language makes use of figures of speech to convey non-literal meaning [2, 16] . it encompasses a variety of phenomena, including metaphor, humor, and irony. we focus here on irony and uses it as an umbrella term that covers satire, parody and sarcasm. irony detection (id) has gained relevance recently, due to its importance to extract information from texts. for example, to go beyond the literal matches of user queries, veale enriched information retrieval with new operators to enable the non-literal retrieval of creative expressions [40] . also, the performances of sentiment analysis systems drastically decrease when applied to ironic texts [5, 19] . most related work concern english [17, 21] with some efforts in french [23] , portuguese [7] , italian [14] , dutch [26] , hindi [37] , spanish variants [31] and arabic [11, 22] . bilingual id with one model per language has also been explored, like english-czech [32] and english-chinese [38] , but not within a cross-lingual perspective. in social media, such as twitter, specific hashtags (#irony, #sarcasm) are often used as gold labels to detect irony in a supervised learning setting. although recent studies pointed out the issue of false-alarm hashtags in selflabeled data [20] , id via hashtag filtering provides researchers positive examples with high precision. on the other hand, systems are not able to detect irony in languages where such filtering is not always possible. multilingual prediction (either relying on machine translation or multilingual embedding methods) is a common solution to tackle under-resourced languages [6, 33] . while multilinguality has been widely investigated in information retrieval [27, 34] and several nlp tasks (e.g., sentiment analysis [3, 4] and named entity recognition [30] ), no one explored it for irony. we aim here to bridge the gap by tackling id in tweets from both multilingual (french, english and arabic) and multicultural perspectives (indo-european languages whose speakers share quite the same cultural background vs. less culturally close languages). our approach does not rely either on machine translation or parallel corpora (which are not always available), but rather builds on previous corpus-based studies that show that irony is a universal phenomenon and many languages share similar irony devices. for example, karoui et al. [24] concluded that their multi-layer annotated schema, initially used to annotate french tweets, is portable to english and italian, observing relatively the same tendencies in terms of irony categories and markers. similarly, chakhachiro [8] studies irony in english and arabic, and shows that both languages share several similarities in the rhetorical (e.g., overstatement), grammatical (e.g., redundancy) and lexical (e.g., synonymy) usage of irony devices. the next step now is to show to what extent these observations are still valid from a computational point of view. our contributions are: i. a new freely available corpus of arabic tweets manually annotated for irony detection 1 . ii. monolingual id: we propose both feature-based models (relying on language-dependent and language-independent features) and neural models to measure to what extent id is language dependent. iii. cross-lingual id: we experiment using cross-lingual word representation by training on one language and testing on another one to measure how the proposed models are culture-dependent. our results are encouraging and open the door to id in languages that lack of annotated data for irony. arabic dataset (ar = 11,225 tweets). our starting point was the corpus built by [22] that we extended to different political issues and events related to the middle east and maghreb that hold during the years 2011 to 2018. tweets were collected using a set of predefined keywords (which targeted specific political figures or events) and containing or not arabic ironic hashtags 2 . the collection process resulted in a set of 6,809 ironic tweets (i) vs. 15,509 non ironic (ni) written using standard (formal) and different arabic language varieties: egypt, gulf, levantine, and maghrebi dialects. to investigate the validity of using the original tweets labels, a sample of 3,000 i and 3,000 ni was manually annotated by two arabic native speakers which resulted in 2,636 i vs. 2,876 ni. the inter-annotator agreement using cohen's kappa was 0.76, while the agreement score between the annotators' labels and the original labels was 0.6. agreements being relatively good knowing the difficulty of the task, we sampled 5,713 instances from the original unlabeled dataset to our manually labeled part. the added tweets have been manually checked to remove duplicates, very short tweets and tweets that depend on external links, images or videos to understand their meaning. french dataset (fr = 7,307 tweets). we rely on the corpus used for the deft 2017 french shared task on irony [5] which consists of tweets relative to a set of topics discussed in the media between 2014 and 2016 and contains topic keywords and/or french irony hashtags (#ironie, #sarcasme). tweets have been annotated by three annotators (after removing the original labels) with a reported cohen's kappa of 0.69. english dataset (en = 11,225 tweets). we use the corpus built by [32] which consists of 100,000 tweets collected using the hashtag #sarcasm. it was used as benchmark in several works [13, 18] . we sliced a subset of approximately 11,200 tweets to match the sizes of the other languages' datasets. table 1 shows the tweet distribution in all corpora. across the three languages, we keep a similar number of instances for train and test sets to have fair cross-lingual experiments as well (see sect. 4). also, for french, we use the original dataset without any modification, keeping the same number of records for train and test to better compare with state-of-the-art results. for the classes distribution (ironic vs. non ironic), we do not choose a specific ratio but we use the resulted distribution from the random shuffling process. it is important to note that our aim is not to outperform state-of-the-art models in monolingual id but to investigate which of the monolingual architectures (neural or feature-based) can achieve comparable results with existing systems. the result can show which kind of features works better in the monolingual settings and can be employed to detect irony in a multilingual setting. in addition, it can show us to what extend id is language dependent by comparing their results to multilingual results. two models have been built, as explained below. prior to learning, basic preprocessing steps were performed for each language (e.g., removing foreign characters, ironic hashtags, mentions, and urls). feature-based models. we used state-of-the-art features that have shown to be useful in id: some of them are language-independent (e.g., punctuation marks, positive and negative emoticons, quotations, personal pronouns, tweet's length, named entities) while others are language-dependent relying on dedicated lexicons (e.g., negation, opinion lexicons, opposition words). several classical machine learning classifiers were tested with several feature combinations, among them random forest (rf) achieved the best result with all features. neural model with monolingual embeddings. we used convolutional neural network (cnn) network whose structure is similar to the one proposed by [25] . for the embeddings, we relied on arav ec [36] for arabic, fasttext [15] for french, and word2vec google news [29] for english 3 . for the three languages, the size of the embeddings is 300 and the embeddings were fine-tuned during the training process. the cnn network was tuned with 20% of the training corpus using the hyperopt 4 library. table 2 shows the results obtained when using train-test configurations for each language. for english, our results, in terms of macro f-score (f ), were not comparable to those of [32, 39] , as we used 11% of the original dataset. for french, our scores are in line with those reported in state of the art (cf. best system in the irony shared task achieved f = 78.3 [5] ). they outperform those obtained for arabic (a = 71.7) [22] and are comparable to those recently reported in the irony detection shared task in arabic tweets [11, 12] (f = 84.4). overall, the results show that semantic-based information captured by the embedding space are more productive comparing to standard surface and lexicon-based features. we use the previous cnn architecture with bilingual embedding and the rf model with surface features (e.g., use of personal pronoun, presence of interjections, emoticon or specific punctuation) 5 to verify which pair of the three languages: (a) has similar ironic pragmatic devices, and (b) uses similar textbased pattern in the narrative of the ironic tweets. as continuous word embedding spaces exhibit similar structures across (even distant) languages [28] , we use a multilingual word representation which aims to learn a linear mapping from a source to a target embedding space. many methods have been proposed to learn this mapping such as parallel data supervision and bilingual dictionaries [28] or unsupervised methods relying on monolingual corpora [1, 10, 41] . for our experiments, we use conneau et al.'s approach as it showed superior results with respect to the literature [10] . we perform several experiments by training on one language (lang 1 ) and testing on another one (lang 2 ) (henceforth lang 1 → lang 2 ). we get 6 configurations, plus two others to evaluate how irony devices are expressed cross-culturally, i.e. in european vs. non european languages. in each experiment, we took 20% from the training to validate the model before the testing process. table 3 presents the results. from a semantic perspective, despite the language and cultural differences between arabic and french languages, cnn results show a high performance comparing to the other languages pairs when we train on each of these two languages and test on the other one. similarly, for the french and english pair, but when we train on french they are quite lower. we have a similar case when we train on arabic and test on english. we can justify that by, the language presentation of the arabic and french tweets are quite informal and have many dialect words that may not exist in the pretrained embeddings we used comparing to the english ones (lower embeddings coverage ratio), which become harder for the cnn to learn a clear semantic pattern. another point is the presence of arabic dialects, where some dialect words may not exist in the multilingual pretrained embedding model that we used. on the other hand, from the text-based perspective, the results show that the text-based features can help in the case when the semantic aspect shows weak detection; this is the case for the ar −→ en configuration. it is worthy to mention that the highest result we get in this experiment is from the en → fr pair, as both languages use latin characters. finally, when investigating the relatedness between european vs. non european languages (cf. (en/fr) → ar), we obtain similar results than those obtained in the monolingual experiment (macro f-score 62.4 vs. 68.0) and best results are achieved by ar → (en/fr). this shows that there are pragmatic devices in common between both sides and, in a similar way, similar text-based patterns in the narrative way of the ironic tweets. this paper proposes the first multilingual id in tweets. we show that simple monolingual architectures (either neural or feature-based) trained separately on each language can be successfully used in a multilingual setting providing a crosslingual word representation or basic surface features. our monolingual results are comparable to state of the art for the three languages. the cnn architecture trained on cross-lingual word representation shows that irony has a certain similarity between the languages we targeted despite the cultural differences which confirm that irony is a universal phenomena, as already shown in previous linguistic studies [9, 24, 35] . the manual analysis of the common misclassified tweets across the languages in the multilingual setup, shows that classification errors are due to three main factors. (1) first, the absence of context where writers did not provide sufficient information to capture the ironic sense even in the monolingual setting, as in (let's start again, get off get off mubarak!! ) where the writer mocks the egyptian revolution, as the actual president "sisi" is viewed as mubarak's fellows. (2) second, the presence of out of vocabulary (oov) terms because of the weak coverage of the multilingual embeddings which make the system fails to generalize when the oov set of unseen words is large during the training process. we found tweets in all the three languages written in a very informal way, where some characters of the words were deleted, duplicated or written phonetically (e.g phat instead of fat). (3) another important issue is the difficulty to deal with the arabic language. arabic tweets are often characterized by non-diacritised texts, a large variations of unstandardized dialectal arabic (recall that our dataset has 4 main varieties, namely egypt, gulf, levantine, and maghrebi), presence of transliterated words (e.g. the word table becomes (tabla)), and finally linguistic code switching between modern standard arabic and several dialects, and between arabic and other languages like english and french. we found some tweets contain only words from one of the varieties and most of these words do not exist in the arabic embeddings model. for example in (since many days mubarak didn't die .. is he sick or what? #egypt), only the words (day), (mubarak), and (he) exist in the embeddings. clearly, considering only these three available words, we are not able to understand the context or the ironic meaning of the tweet. to conclude, our multilingual experiments confirmed that the door is open towards multilingual approaches for id. furthermore, our results showed that id can be applied to languages that lack of annotated data. our next step is to experiment with other languages such as hindi and italian. unsupervised neural machine translation irony as relevant inappropriateness comparative experiments using supervised learning and machine translation for multilingual sentiment analysis bilingual sentiment embeddings: joint projection of sentiment across languages analyse d'opinion et langage figuratif dans des tweets présentation et résultats du défi fouille de textes deft2017 multilingual natural language processing applications: from theory to practice clues for detecting irony in user-generated contents: oh s "so easy";-) translating irony in political commentary texts from english into arabic irony as indirectness cross-linguistically: on the scope of generic mechanisms word translation without parallel data idat@fire2019: overview of the track on irony detection in arabic tweets idat@fire2019: overview of the track on irony detection in arabic tweets ldr at semeval-2018 task 3: a low dimensional text representation for irony detection annotating irony in a novel italian corpus for sentiment analysis learning word vectors for 157 languages logic and conversation semeval-2018 task 3: irony detection in english tweets sentiment polarity classification of figurative language: exploring the role of irony-aware and multifaceted affect features irony detection in twitter: the role of affective content disambiguating false-alarm hashtag usages in tweets for irony detection irony detection with attentive recurrent neural networks soukhria: towards an irony detection system for arabic in social media towards a contextual pragmatic model to detect irony in tweets exploring the impact of pragmatic phenomena on irony detection in tweets: a multilingual corpus study convolutional neural networks for sentence classification the perfect solution for detecting sarcasm in tweets# not unsupervised cross-lingual information retrieval using monolingual data only efficient estimation of word representations in vector space linguistic regularities in continuous space word representations improving multilingual named entity recognition with wikipedia entity type mapping overview of the task on irony detection in spanish variants sarcasm detection on czech and english twitter a survey of cross-lingual embedding models cross-lingual learning-torank with shared representations a contrastive study of ironic expressions in english and arabic aravec: a set of arabic word embedding models for use in arabic nlp a corpus of english-hindi code-mixed tweets for sarcasm detection chinese irony corpus construction and ironic structure analysis reasoning with sarcasm by reading inbetween creative language retrieval: a robust hybrid of information retrieval and linguistic creativity unsupervised cross-lingual word embedding by multilingual neural language models key: cord-020932-o5scqiyk authors: zhong, wei; rohatgi, shaurya; wu, jian; giles, c. lee; zanibbi, richard title: accelerating substructure similarity search for formula retrieval date: 2020-03-17 journal: advances in information retrieval doi: 10.1007/978-3-030-45439-5_47 sha: doc_id: 20932 cord_uid: o5scqiyk formula retrieval systems using substructure matching are effective, but suffer from slow retrieval times caused by the complexity of structure matching. we present a specialized inverted index and rank-safe dynamic pruning algorithm for faster substructure retrieval. formulas are indexed from their operator tree (opt) representations. our model is evaluated using the ntcir-12 wikipedia formula browsing task and a new formula corpus produced from math stackexchange posts. our approach preserves the effectiveness of structure matching while allowing queries to be executed in real-time. in information retrieval, a great deal of research has gone into creating efficient search engines for large corpora. however, few have addressed substructure search in structural content, e.g., in mathematical information retrieval (mir) [21] where efficient substructure similarity search is needed to identify shared subexpressions effectively. for example, in math formula search, to discern that a + b and b + a are equivalent (by commutativity), but that ab + cd and a + bcd are different, applying tokenization and counting common token frequencies is insufficient. instead, a hierarchical representation of mathematical operations is needed and we may want to identify shared substructures. in the most recent math similarity search competition, 1 effective systems all take a tree-based approach by extracting query terms from tree representations. for example, an operator tree (opt) is used in fig. 1 to represent math formulas where operands are represented by leaves and operators are located at internal nodes. this facilitates searching substructures shared by two math expressions. for example, we can extract paths from their tree representations and find their shared subtrees by matching their common paths grouped by subtree root nodes. however, in order to carry structure information, it is common to see structural queries with over tens or even hundreds of path tokens which is unusual for normal fulltext search. this makes query processing costly for realistic math search tasks. in text similarity search, query processing can be accelerated through dynamic pruning [18] , which typically estimates score upperbounds to prune documents unlikely to be in the top k results. however, effective substructure search requires additional matching or alignment among query terms, and this makes it hard to get a good score estimation and it prevents us applying traditional dynamically pruning effectively. in fact, reportedly few state-of-the-art mir systems have achieved practical query run times even when given a large amount of computing resources [11, 20] . in this paper we try to address this problem by introducing a specialized inverted index and we propose a dynamic pruning method based on this inverted index to boost formula retrieval efficiency. recently there has been an increasing amount of research on similarity search for math formulas, with most focusing on search effectiveness [5, 7, 11, 23] . there are many emerging issues regarding effectiveness, including handling mathematical semantics, and identifying interchangeable symbols and common subexpressions. however, the efficiency of math formula search systems is often not addressed. a number of mir systems apply text search models to math retrieval, extracting sequential features from formulas and use variants of tf-idf scoring [12, 14, 16] . these approaches incorporate a bag-of-words model, and use frequency to measure formula similarity. inevitably, they need to index different combinations of sequences or substrings to handle operator commutativity and subexpression identification. this index augmentation results in a non-linearly increasing index size in the number of indexed "words" [12] and thus hurts efficiency for large corpora. on the other hand, recent results [10, 20, 23] reveal that effective systems for formula retrieval use tree-based approaches distinct from text-based methods. however, tree-based systems usually need to calculate costly graph matching or edit distance metrics [9, 22] , which generally have non-linear time complexity. recently, a path-based approach [23] was developed to search substructures in formula opts approximately by assuming that identical formulas have the same leaf-root path set. although at the time of writing, it obtains the best effectiveness for the ntcir-12 dataset, the typically large number of query paths means that query run times are not ideal -maximum run times can be a couple of seconds. dynamic pruning has been recognized as an effective way to reduce query processing times [2, 8, 13, 18] . dynamic pruning speeds up query processing by skipping scoring calculations or avoiding unnecessary reads for documents which are unlikely to be ranked in the top k results. pruning methods can be based on different query processing schemes: document-at-a-time (daat) requires all relevant posting lists be merged simultaneously. term-at-a-time (taat) or score-at-a-time (saat) processes one posting list at a time for each term, requiring additional memory to store partial scores, and posting lists in this case are usually sorted by document importance (e.g, impact score [1] ), with promising documents placed at the front of inverted lists. pruning strategies are rank-safe (or safe up to rank k ) [19] if they guarantee that the top k documents are ranked in the same order before and after pruning. the most well-known rank-safe pruning strategies for daat are maxscore [8, 17, 19] and wand variants [3, 6] . shan et al. [15] show that maxscore variants (e.g. bmm, lbmm) outperform other dynamic pruning strategies for long queries, and recently mallia et al. [2] report a similar finding over a range of popular index encodings. baseline model. this work is based on our previous work [23] which extracts prefixes from opt leaf-root paths as index or query terms. the opt is parsed from a formula in l a t e x. for indexed paths, they are mapped to corresponding posting lists in an inverted index where the ids of expressions containing the path are appended. for query paths, the corresponding posting lists are merged and approximate matching is performed on candidates one expression at a time. the similarity score is measured from matched common subtree(s). because math symbols are interchangeable, paths are tokenized for better recall, e.g., variables such as a, b, c are tokenized into var. in our tokenized path representation uppercase words denote token types, which may be for operators as well as operands (e.g., times for symbols representing multiplication). in fig. 1 , when indexing "bc + xy + a + z," its expression id (or expid) will be appended to posting lists associated with tokenized prefix paths from its opt representation, i.e., var/times, var/add and var/times/add. at query processing, the shared structures highlighted in black and gray are found by matching these tokenized paths (two paths match if and only if they have the same tokenized paths, for example, "a/+" and "z/+" can be matched) and common subtree roots are identified by grouping paths by their root nodes. as a result, the posting list entry also stores the root node id for indexed paths, in order to reconstruct matches substructures at merge time. at query time, the similarity score is given by the size of matched common subtrees. specifically, the model chooses a number of "widest" matched subtree(s) (e.g., a + bc is the widest matched in fig. 1 because it has 3 common leaves and is "wider" than the other choices) and measure formula similarity based on the size of these common subtrees. the original approach0 model [23] matches up to three widest common subtrees and scores similarity by a weighted sum of the number of matched leaves (operands) and operators from different common subtreest i q ,t i d of a common forest π. operators and operand (leaf) nodes weights are controlled by parameter α, while the weight of rooted substructures from largest to smallest are given by β i . in the following, | · | indicates the size of a set: interestingly, while multiple subtree matching boosts effectiveness, using just the widest match still outperforms other systems in terms of highly relevant results [23] . the simplified similarity score based on widest common subtree between query and document opts t q , t d is the widest match w * q,d , formally where cfs(t q , t d ) are all the common formula subtrees between t q and t d . in addition to subtree isomorphism, a formula subtree requires leaves in a subtree to match leaves in the counterpart, in other words, subtrees are matched bottomup from operands in opts. in fig. 1 , the value of w * q,d is 3, produced by the widest common subtrees shown in gray. dynamic pruning. in dynamic pruning, the top k scored hits are kept throughout the querying process, with the lowest score in the top k at a given point defining the threshold θ. since at most k candidates will be returned, dynamic pruning strategies work by estimating score upperbounds before knowing the precise score of a hit so that candidate hits with a score upperbound less or equal to θ can be pruned safely, because they will not appear in the final top k results. moreover, if a subset of posting lists alone cannot produce a top k result from their upperbounds, they are called a non-requirement set, the opposite being the requirement set. posting lists in the non-requirement with ids less than the currently evaluating ids in the requirement set can be skipped safely, because posting lists in the non-requirement set alone will not produce a top k candidate. in this paper, we apply dynamic pruning to structural search. as structure search has more query terms in general, we focus on a maxscore-like strategy suggested by [2, 15] , since they do not need to sort query terms at merge iterations (which is expensive for long queries). our approach is different from the original maxscore, as upperbound scores are also calculated from the query tree representation. we also use the simplified scoring eq. (2) where a subset of query terms in the widest matched common subtreest * q ,t * d contribute to the score. in contrast, typical tf-idf scoring has all hit terms contribute to the rank score. when we merge posting lists, a set of query paths match paths from a document expression one at a time, each time a hit path set for matched query and candidate paths are examined. define p(t ) to be all paths extracted from opt t , i.e., p(t ) = {p : p ∈ leafroot paths(t n ), n ∈ t } where t n is the entire subtree of t rooted at n with all its descendants. we model the hit path set by a bipartite graph g(q, d, e) where q = {q : q ∈ p(t q )}, d = {d : d ∈ p(t d )} are query and document path sets, and edges are ordered pairs e = {(q, d) : tokenized(q) = tokenized(d), q ∈ q, d ∈ d} representing a potential match between a query path to a document path. since an edge is established only for paths with the same token sequence, we can partition the graph into disconnected smaller bipartite graphs g t = g(q t , d t , e t ), each identified by tokenized query path t: figure 2 shows the hit path set of the example in fig. 1 , this example can be partitioned into independent subgraphs associated with tokenized paths var/times/add, var/times and var/add. each partition is actually a complete bipartite graph (fully connected) because for any edge between q t and d t , it is in edge set e t . and for each complete bipartite graph g(q t , d t , e t ), we can obtain their maximum matching sizes from min(|q t |, |d t |) easily. on the other hand, to calculate score w * q,d , we need to find a pair of query and document nodes at which the widest common subtreet * q ,t * d are rooted (see eq. 2), so we also define the matching candidate relations filtered by nodes. let g (m,n) = g(q (m) , d (n) , e (m,n) ) be the subgraph matching between query subtree rooted at m and document subtree rooted at n where then, similarity score w * q,d can be calculated from selecting the best matched node pairs and summing their partition matches. specifically, define token paths of tree t rooted at n as set t(n) = {t : t = tokenized(p), p ∈ leafroot paths(t n )}, where ν(g) is the maximum matching size of bipartite graph g. t |) as our (precomputed) partial score upperbound. it is analogous to text search where each posting list has a partial score upperbound, the tf-idf score upperbound is merely their sum. in our case, the sum for partial score upperbounds is only for one node or a subtree. in the following we propose three strategies to compute w * q,d upperbound from partial score upperbounds and assign non-requirement set. max reference (maxref ) strategy. in maxscore [17, 19] , each posting list has a partial score upperbound, however, our scoring function implies each posting list can be involved with multiple partial score upperbounds. one way to select the non-requirement set in our case is using an upperbound score maxref t (for each posting list t) which is the maximum partial score from the query nodes by which this posting list gets "referenced", and if a set of posting lists alone has a sum of maxref scores less or equal to θ, they can be safely put into the non-requirement set. the rank safety can be justified, since each posting list corresponds to a unique tokenized path t, and maxref t = max m w m,t . then for m ∈ t q , n ∈ t d , greedy binary programming (gbp) strategies. inequality (6) is relaxed twice, so it spurs the motivation to get tighter upperbound value by maximizing the number of posting lists in the non-requirement set, so that more posting lists are likely to be skipped. define partial upperbound matrix w = {w i,j } |tq|×|t| where t = {t(m), m ∈ t q } are all the token paths from query opt (t is essentially the same as tokenized p(t q )), and a binary variable x |t|×1 indicating which corresponding posting lists are placed in the non-requirement set. one heuristic objective is to maximize the number of posting lists in the non-requirement set (gbp-num): however, maximizing the number of posting lists in the non-requirement set does not necessarily cause more items to be skipped, because the posting lists can be very short. instead, we can maximize the total length of posting lists in the non-requirement set. in this case, the vector of ones in objective function (7) is replaced with posting list length vector l = l 1 , l 2 , . . . l |t| , where l i is the length of posting list i. we call this strategy gbp-len. the two gbp strategies are rank-safe since constraints in inequality (8) implies t∈skip w m,t ≤ θ. both strategies require solving binary programming problems, which are known to be np-complete and thus too intensive for long queries. instead, we greedily follow one branch of the binary programming sub-problems to obtain a feasible (but not optimal) solution in o(|t q ||t| 2 ). figure 3 illustrates formula query processing using a modified inverted index for dynamic pruning. for each internal node m of the query opt, we store the number of leaves of m as w m = |q (m) |. each query node points to tokenized path entries in a dictionary, where each reference is associated with w m,t = |q (m) t | identified by tokenized path t (denoted as m/w m of t). in fig. 3 , node q1 from the query has 6 leaves, which is also the upperbound number of path matches for q1, i.e, |q (1) |. since q1 consists of 2 tokenized leaf-root paths var/times/add and var/add, q1 is linked to two posting lists, each associated with a partial score upperbound (5 and 1). each posting list maps to a token path t ∈ t with a dynamic counter for the number of query nodes referring to it (initially |q t |). query nodes are pruned by our algorithm when its subtree width is no longer greater than the current threshold, because the corresponding subexpression cannot be in the top-k results. in this case the reference counter decreases. a posting list is removed if its reference counter is less than one. each posting list entry identified by an expid stores n and w n,t = |d (n) t | values of subtree token path t rooted at n (denoted as n/w n of t). as an example, in fig. 3 , the hit opt (of expid 12) has 5 paths tokenized as query processing is described in algorithm 1. requirementset returns selected iterators of the requirement set. assignment according to different pruning strategies is described in sect. 4. in the maxref strategy, we sort posting lists by descending maxref values, and take as many posting lists as possible into non-requirement set from the lowest maxref value. at merging, a candidate id is assigned by the minimal expid of current posting list iterators in the requirement set. requirement set iterators are advanced by one using the next() function, while iterators in the non-requirement set are advanced directly to the id equal to or greater than the current candidate by the skipto() function. in fig. 3 for example, the posting list corresponding to var/times/add is in the requirement set under the maxref strategy, while the other two are not: document expression 13 and 15 will be skipped if the next candidate is 90. for ease of testing termination, we append a special expid maxid at the end of each posting list, which is larger than any expid in the collection. at each iteration, a set of hitnodes is inferred containing query nodes associated with posting lists whose current expids are candidate id. qrynode-match calculates matches for hit nodes according to eq. 5, pruning nodes whose maximum matching size is smaller than previously examined nodes. given query hit node q1 in fig. 3 , function qrynodematch returns max n∈t d ν(g (1,n) ) = max(min(5, 2) + min(1, 2), min(5, 3)) = 3 then the algorithm selects the best matched query node and its matched width (i.e., widest in algorithm 1) is our structural similarity w * q,d . after obtaining w * q,d , we compute a metric for the similarity of symbols (e.g., to differentiate e = mc 2 and y = ax 2 ) and penalize larger formulas, to produce a final overall similarity score [23] for ranking. because of this additional layer, we need to relax our upperbound further. according to the overall scoring function in [23] , our relaxing function u can be defined by assuming perfect symbol similarity score in overall scoring function, specifically where in our setting, parameters η = 0.05, n d = 1. whenever threshold θ is updated, we will examine all the query nodes, if a query node m has an upperbound less or equal to the threshold, i.e., u(w m ) ≤ θ, then the corresponding subtree of this node is too "small" to make it into top k results. as a result, some of the posting lists (or iterators) may also be dropped due to zero reference. for each m/wm of tokenized path t rooted at m do let i be the iterator index associated with t if heap := data structure to hold top k results while true do 20: candidate := minimal id in current expids of reqs if candidate equals maxid then search terminated, return results. return top k results let g(q, d, e) be the hit path set bipartite graph. if maxmatch > widest then widest := maxmatch find the widest width. if widest > 0 then score := calculate final score (including symbol similarity). see [23] . if heap is not full or score > θ then push candidate or replace the lowest scored hit in heap. if heap is full then update current threshold. θ := minimal score in current top k results drop small query nodes and unreferenced iterators. reqs := requirementset(θ, strategy) update requirement set. for iters[i] in reqs do advance posting list iterators. if iters[i].expid = candidate then iters[i].next() we first evaluate our system 2 on the ntcir-12 wikipedia formula browsing task [20] (ntcir-12 for short), which is the most current benchmark for formula-only retrieval. the dataset contains over 590,000 math expressions taken from english wikipedia. since work in formula retrieval is relatively new, there are only 40 queries in ntcir-12 that can be compared with other published systems. however, these queries are well designed to cover a variety of math expressions in different complexity. there are 20 queries containing wildcards in this task (using wildcard specifier \qvar to match arbitrary subexpression or symbols, e.g., query "\qvar{a} 2 + \qvar{b} 3 " can match "x 2 + (y + 1) 3 "). we add support for wildcards by simply treating internal nodes (representing a rooted subexpression) of formulas as additional "leaves" (by ignoring their descendants), and the wildcard specifiers in a query are treated as normal leaves to match those indexed wildcard paths. since the corpus of ntcir-12 is not large enough to show the full impact of pruning, we also evaluate query run times on a corpus containing over 1 million math related documents/threads from math stackexchange (mse) q&a website 3 and we run the same query set from ntcir-12. run times are shown for the posting list merging stage (e.g., time for parsing the query into opt is excluded) and unless specified, posting lists are compressed and cached into memory. each system had five independent runs, and we report results from overall distribution. the resulting uncompressed index size for ntcir-12 and mse corpora are around 2 gb and 16 gb in size, with 961,604 and 5,764,326 posting lists respectively. the (min, max, mean, standard deviation) for posting list lengths are (1, 262309, 16.95, 737.84) and (1, 7916296, 73.74, 9736.72) . table 1 reports run time statistics. non-pruning (exhaustive search) baselines with k = 100 are also compared here. almost consistently, gbp-len strategy achieves the best efficiency with smaller variance. this is expected since gbp-len models the skipping possibility better than gbp-num. although gbp-num gives a tighter theoretic upperbound than maxref, it only maximizes the number of posting lists in the non-requirement set and may lead to bad performance when these posting lists are short. there are a few times the best minimal run times are from other strategies, for those with meaningful gaps, i.e., in wiki dataset of non-wildcard queries when k = 1000, maxref outperforms in standard deviation and maximum run time to a notable margin; however, it likely results from a small threshold due to large k, so that the efficiency on the small sized ntcir dataset is less affected by pruning (small θ means less pruning potential) compared to the time complexity added from assigning to the requirement set. the latter is more dominant in gbp runs. in wildcard queries, however, many expressions can match the query thus the threshold value is expected to be larger than that in the non-wildcard case. len 144.25 126.95 105.00 6.00 622.00 195.70 122.25 176.00 9.00 secondly, we have compared our system effectiveness (fig. 4) and efficiency ( fig. 5) with tangent-s [5] , mcat [11] and our baseline system without pruning [23] , which are all structure-based formula search engines that have obtained the best published bpref scores on ntcir-12 dataset. in addition, icst system [7] also obtains effective results for math and text mixed task, but they do training on previous wiki dataset and their system is currently not available. all systems are evaluated in a single thread for top-1000 results. we use our best performance strategy, i.e., gbp-len, having an on-disk version with posting lists uncompressed and always read from disk, and an in-memory version with compression. for the baseline system, only 20 non-wildcard queries are reported because it does not support wildcards. we compare the baseline best performed run (base-best) which uses costly multiple tree matching as well as its specialized version (base-opd-only) which considers only the largest matched tree width (see eq. 2). tangent-s has a few outliers as a result of its costly alignment algorithm to rerank structure and find the maximum subtree similarity [22] , its non-linear complexity makes it expensive for some long queries (especially in wildcard case). and mcat reportedly has a median query execution time around 25 s, using a server machine and multi-threading [11] . so we remove tangent-s outliers and mcat from runtime boxplot. for space, we only include the faster base-opd-only baseline in fig. 5 . we outperform tangent-s in efficiency even if we exclude their outlier queries, with higher bpref in non-wildcard fully relevant results. our efficiency is also better than the baseline system, even if the latter only considers less complex non-wildcard queries. however, our overall effectiveness is skewed by bad performance of wildcard queries because a much more expensive phase is introduced to boost accuracy by other systems to handle inherently difficult "structural wildcards." our pruning strategies are rank-safe (pruning and exhaustive version shows the same bpref scores) but there is a minor bpref difference between ours and baseline (base-opd-only) due to parser changes we have applied to support wildcards (e.g., handle single left brace array as seen in a wildcard query) and they happen to slightly improve accuracy in partially relevant cases. we have presented rank-safe dynamic pruning strategies that produce an upperbound estimation of structural similarity in order to speedup formula search using subtree matching. our dynamic pruning strategies and specialized inverted index are different from traditional linear text search pruning methods and they further associate query structure representation with posting lists. our results show we can obtain substantial improvement in efficiency over the baseline model, while still generating highly relevant non-wildcard search results. our approach can process a diverse set of structural queries in real time. pruned query evaluation using pre-computed impacts an experimental study of index compression and daat query processing methods efficient query evaluation using a two-level retrieval process retrieval evaluation with incomplete information layout and semantics: combining representations for mathematical formula search faster top-k document retrieval using block-max indexes the math retrieval system of icst for ntcir-12 mathir task efficient compressed inverted index skipping for disjunctive text-queries structural similarity search for mathematics retrieval tangent-v: math formula image search using line-of-sight graphs mcat math retrieval system for ntcir-12 mathir task a mathematics retrieval system for formulae in layout presentations upper-bound approximations for dynamic pruning technical aspects of the digital library of mathematical functions optimized top-k processing with global page scores on block-max indexes indexing and searching mathematics in digital libraries optimization strategies for complex queries efficient query processing for scalable web search query evaluation: strategies and optimizations ntcir-12 mathir task overview recognition and retrieval of mathematical expressions multi-stage math formula search: using appearance-based similarity metrics at scale structural similarity search for formulas using leaf-root paths in operator subtrees key: cord-020899-d6r4fr9r authors: doinychko, anastasiia; amini, massih-reza title: biconditional generative adversarial networks for multiview learning with missing views date: 2020-03-17 journal: advances in information retrieval doi: 10.1007/978-3-030-45439-5_53 sha: doc_id: 20899 cord_uid: d6r4fr9r in this paper, we present a conditional gan with two generators and a common discriminator for multiview learning problems where observations have two views, but one of them may be missing for some of the training samples. this is for example the case for multilingual collections where documents are not available in all languages. some studies tackled this problem by assuming the existence of view generation functions to approximately complete the missing views; for example machine translation to translate documents into the missing languages. these functions generally require an external resource to be set and their quality has a direct impact on the performance of the learned multiview classifier over the completed training set. our proposed approach addresses this problem by jointly learning the missing views and the multiview classifier using a tripartite game with two generators and a discriminator. each of the generators is associated to one of the views and tries to fool the discriminator by generating the other missing view conditionally on the corresponding observed view. the discriminator then tries to identify if for an observation, one of its views is completed by one of the generators or if both views are completed along with its class. our results on a subset of reuters rcv1/rcv2 collections show that the discriminator achieves significant classification performance; and that the generators learn the missing views with high quality without the need of any consequent external resource. we address the problem of multiview learning with generative adversarial networks (gans) in the case where some observations may have missing views without there being an external resource to complete them. this is a typical situation in many applications where different sources generate different views of samples unevenly; like text information present in all wikipedia pages while images are more scarce. another example is multilingual text classification where documents are available in two languages and share the same set of classes while some are just written in one language. previous works supposed the existence of view generating functions to complete the missing views before deploying a learning strategy [2] . however, the performance of the global multiview approach is biased by the quality of the generating functions which generally require external resources to be set. the challenge is hence to learn an efficient model from the multiple views of training data without relying on an extrinsic approach to generate altered views for samples that have missing ones. in this direction, gans provide a propitious and broad approach with a high ability to seize the underlying distribution of the data and create new samples [11] . these models have been mostly applied to image analysis and major advances have been made on generating realistic images with low variability [7, 15, 16] . gans take their origin from the game theory and are formulated as a two players game formed by a generator g and a discriminator d. the generator takes a noise z and produces a sample g(z) in the input space, on the other hand the discriminator determines whenever a sample comes from the true distribution of the data or if it is generated by g. other works included an inverse mapping from the input to the latent representation, mostly referred to as bigans, and showed the usefulness of the learned feature representation for auxiliary discriminant problems [8, 9] . this idea paved the way for the design of efficient approaches for generating coherent synthetic views of an input image [6, 14, 21] . in this work, we propose a gan based model for bilingual text classification, called cond 2 gans, where some training documents are just written in one language. the model learns the representation of missing versions of bilingual documents jointly with the association to their respective classes, and is composed of a discriminator d and two generators g 1 and g 2 formulated as a tripartite game. for a given document with a missing version in one language, the corresponding generator induces the latter conditionally on the observed one. the training of the generators is carried out by minimizing a regularized version of the cross-entropy measure proposed for multi-class classification with gans [19] in a way to force the models to generate views such that the completed bilingual documents will have high class assignments. at the same time, the discriminator learns the association between documents and their classes and distinguishes between observations that have their both views and those that got a completed view by one of the generators. this is achieved by minimizing an aggregated cross-entropy measure in a way to force the discriminator to be certain of the class of observations with their complete views and uncertain of the class of documents for which one of the versions was completed. the regularization term in the objectives of generators is derived from an adapted feature matching technique [17] which is an effective way for preventing from situations where the models become unstable; and which leads to fast convergence. we demonstrate that generated views allow to achieve state-of-the-art results on a subset of reuters rcv1/rcv2 collections compared to multiview approaches that rely on machine translation (mt) for translating documents into languages in which their versions do not exist; before training the models. importantly, we exhibit qualitatively that generated documents have meaningful translated words bearing similar ideas compared to the original ones; and that, without employing any large external parallel corpora to learn the translations as it would be the case if mt were used. more precisely, this work is the first to: -propose a new tripartite gan model that makes class prediction along with the generation of high quality document representations in different input spaces in the case where the corresponding versions are not observed (sect. 3.2); -achieve state-of-the art performance compared to multiview approaches that rely on external view generating functions on multilingual document classification; and which is another challenging application than image analysis which is the domain of choice for the design of new gan models (sect. 4.2); -demonstrate the value of the generated views within our approach compared to when they are generated using mt (sect. 4.2). multiview learning has been an active domain of research these past few years. many advances have been made on both theoretic and algorithmic sides [5, 12] . the three main families of techniques for (semi-)supervised learning are (kernel) canonical correlation analysis (cca), multiple kernel learning (mkl) and co-regularization. cca finds pairs of highly correlated subspaces between the views that is used for mapping the data before training, or integrated in the learning objective [3, 10] . mkl considers one kernel per view and different approaches have been proposed for their learning. in one of the earliest work, [4] proposed an efficient algorithm based on sequential minimization techniques for learning a corresponding support vector machine defined over a convex nonsmooth optimization problem. co-regularization techniques tend to minimize the disagreement between the single-view classifiers over their outputs on unlabeled examples by adding a regularization term to the objective function [18] . some approaches have also tackled the tedious question of combining the predictions of the view specific classifiers [20] . however all these techniques assume that views of a sample are complete and available during training and testing. recently, many other studies have considered the generation of multiple views from a single input image using gans [14, 21, 23] and have demonstrated the intriguing capacity of these models to generate coherent unseen views. the former approaches rely mostly on an encoder-encoder network to first map images into a latent space and then generate their views using an inverse mapping. this is a very exciting problem, however, our learning objective differs from these approaches as we are mostly interested in the classification of muti-view samples with missing views. the most similar work to ours that uses gans for multiview classification is probably [6] . this approach generates missing views of images in the same latent space than the input image, while cond 2 gans learns the representations of the missing views in their respective input spaces conditionally on the observed ones which in general are from other feature spaces. furthermore, cond 2 gans benefits from low complexity and stable convergence which has been shown to be lacking in the previous approach. another work which has considered multiview learning with incomplete views, also for document classification, is [2] . the authors proposed a rademacher complexity bounds for a multiview gibbs classifier trained on multilingual collections where the missing versions of documents have been generated by machine translation systems. their bounds exhibit a term corresponding to the quality of the mt system generating the views. the bottleneck is that mt systems depend on external resources, and they require a huge amount of parallel collections containing documents and their translations in all languages of interest for their tuning. for rare languages, this can ultimately affect the performance of the learning models. regarding these aspects our proposed approach differs from all the previous studies, as we do not suppose the existence of parallel training sets nor mt systems to generate the missing versions of the training observations. in the following sections, we first present the basic definitions which will serve to our problem setting, and then the proposed model for multiview classification with missing views. we consider multiclass classification problems, where a bilingual document is defined as a sequence x = (x 1 , x 2 ) ∈ x that belongs to one and only one class y ∈ y = {0, 1} k . the class membership indicator vector y = (y k ) 1≤k≤k , of each bilingual document, has all its components equal to 0 except the one that indicates the class associated with the example which is equal to one. here we suppose that x = ( following the conclusions of the co-training study [5] , our framework is based on the following main assumption: observed views are not completely correlated, and are equally informative. furthermore, we assume that each example (x, y) is identically and independently distributed (i.i.d.) according to a fixed yet unknown distribution d over x ×y, and that at least one of its views is observed. additionally, we suppose to have access to a training set denotes the subset of training samples with their both complete views and is the subset of training samples with their second (respectively first) view that is not observed (i.e. m = m f + m 1 + m 2 ). it is possible to address this problem using existing techniques; for example, by learning singleview classifiers independently on the examples of s s 1 (respectively s s 2 ) for the first (respectively second) view. to make predictions, one can then combine the outputs of the classifiers [20] if both views of a test example are observed; or otherwise, use one of the outputs corresponding to the observed view. another solution is to apply multiview approaches over the training samples of s f ; or over the whole training set s by completing the views of examples in s 1 and s 2 before using external view generation functions. as an alternative, the learning objective of our proposed approach is to generate the missing views of examples in s 1 and s 2 , jointly with the learning of the association between the multiview samples (with all their views complete or completed) and their classes. the proposed model consists of three neural networks that are trained using an objective implementing a three players game between a discriminator, d, and two generators, g 1 and g 2 . the game that these models play is depicted in fig. 1 and it can be summarized as follows. at each step, if an observation is chosen with a missing view, the corresponding generator -g 1 (respectively g 2 ) if the first (respectively second) view is missingproduces the view from random noise conditionally on the observed view in a way to fool the discriminator. on the other hand, the discriminator takes as input an observation with both of its views complete or completed and, classifies it if the views are initially observed or tells if a view was produced by one of the generators. formally, both generators g 1 and g 2 take as input; samples from the training subsets s 2 and s 1 respectively; as well as random noise drawn from a uniform distribution defined over the input space of the missing view and produce the corresponding pseudo-view, which is missing; i.e. g 1 (z 1 , x 2 ) =x 1 and g 2 (x 1 , z 2 ) =x 2 . these models are learned in a way to replicate the conditional distributions p(x 1 |x 2 , z 1 ) and p(x 2 |x 1 , z 2 ); and inherently define two probability distributions, denoted respectively by p g1 and p g2 , as the distribution of samples if both views where observed i.e. (x 1 , . on the other hand, the discriminator takes as input a training sample; either from the set s f , or from one of the training subsets s 1 or s 2 where the missing view of the example is generated by one of the generators accordingly. the task of d is then to recognize observations from s 1 and s 2 that have completed views by g 1 and g 2 and to classify examples from s to their true classes. to achieve this goal we add a fake class, k + 1, to the set of classes, y, corresponding to samples that have one of their views generated by g 1 or g 2 . the dimension of the discriminator's output is hence set to k + 1 which by applying softmax is supposed to estimate the posterior probability of classes for each multiview observation (with complete or completed views) given in input. for an observation x ∈ x , we use d k+1 (x) = p d (y = k + 1|x) to estimate the probability that one of its views is generated by g 1 or g 2 . as the task of the generators is to produce good quality views such that the observation with the completed view will be assigned to its true class with high probability, we follow [17] by supplying the discriminator to not get fooled easily as stated in the following assumption: an observation x has one of its views generated by in the case where; d k+1 (x) ≤ k k=1 d k (x) the observation x is supposed to have its both views observed and it is affected to one of the classes following the rule; max k={1,...,k} d k (x). the overall learning objective of cond 2 gans is to train the generators to produce realistic views indistinguishable with the real ones, while the discriminator is trained to classify multiview observations having their complete views and to identify view generated samples. if we denote by p real the marginal distribution of multiview observations with their both views observed (i.e. (x 1 , x 2 ) = p real (x 1 , x 2 )); the above procedure resumes to the following discriminator objective function v d (d, g 1 , g 2 ): in this way, we stated minmax game over k + 1 component of discriminator. in addition to this objective, we made generators also learn from labels of completed samples. therefore, the following equation defines objective for the generators z) ) . note that, following assumption 1, we impose the generators to produce equally informative views by assigning the same weight to their corresponding terms in the objective functions (eqs. 1, 2). from the outputs of the discriminator for all x ∈ x we build an auxiliary function d(x) = k k=1 p d (y = k | x) equal to the sum of the first k outputs associated to the true classes. in this following, we provide a theoretical analysis of cond 2 gans involving the auxiliary function d under nonparametric hypotheses. for fixed generators g 1 and g 2 , the objective defined in (eq. 1) leads to the following optimal discriminator d * g1,g2 : where p g1,2 (x 1 , x 2 ) = 1 2 (p g1 (x 1 , x 2 ) + p g2 (x 1 , x 2 )). proof. the proof follows from [11] . let from assumption 2, and the fact that for any observation x the outputs of the discriminator sum to one i.e. k+1 k=1 d k (x) = 1, the value function v d writes: for any (α, β, γ) ∈ r 3 \{0, 0, 0}; the function z → α log z + β 2 log(1 − z) + γ 2 log(1 − z) reaches its maximum at z = α α+ 1 2 (β+γ) , which ends the proof as the discrimintaor does not need to be defined outside the supports of p data , p g1 and p g2 . by plugging back d * g1,g2 (eq. 3) into the value function v d we have the following necessary and sufficient condition for attaining the global minimum of this function: at this point, the minimum is equal to − log 4. proof. by plugging back the expression of d * (eq. 3), into the value function which from the definition of the kullback leibler (kl) and the jensen shannon divergence (jsd) can be rewritten as the jsd is always positive and jsd p real pg 1,2 = 0 if and only if p real = pg 1,2 which ends the proof. from eq. 4, it is straightforward to verify that p real (x 1 , is a global nash equilibrium but it may not be unique. in order to ensure the uniqueness, we add the jensen-shannon divergence between the distribution p g1 and p real and p g2 and p real the value function v d (eq. 1) as stated in the corollary below. is reached if and only if where v d (d, g 1 , g 2 ) is the value function defined in eq. (1) and jsd(p g1 ||p real ) is the jensen-shannon divergence between the distribution p g1 and p real . proof. the proof follows from the positivness of jsd and the necessary and sufficient condition for it to be equal to 0. hence,v d (d, this result suggests that at equilibrium, both generators produce views such that observations with their completed view follow the same real distribution than those which have their both views observed. in order to avoid the collapse of the generators [17] , we perform minibatch discrimination by allowing the discriminator to have access to multiple samples in combination. from this perspective, the minmax game (eqs. 1, 2) is equivalent to the maximization of a cross-entropy loss, and we use minibatch training to learn the parameters of the three models. the corresponding empirical errors estimated over a minibatch b that contains m b samples from each of the sets s f , s 1 and s 2 are: lg input: a training set s = s f s 1 s 2 initialization: size of minibatches, m b use xavier initializer to initialize discriminator and generators parameters, respectively θ sample randomly a minibatch b i of size 3m b from s 1 , s 2 and s f ; create minibatches of noise vector z 1 , z 2 from u (−1, 1) in order to be inline with the premises of corollary 1; we empirically tested different solutions and the most effective one that we found was the feature matching technique proposed in [17] , which addressed the problem of instability for the learning of generators by adding a penalty term to their corresponding objectives (eq. (8)). where, . is the 2 norm and f is the sigmoid activation function on an intermediate layer of the discriminator. the overall algorithm of cond 2 gans is shown above. the parameters of the three neural networks are first initialized using xavier. for a given number of iterations t , minibatches of size 3m b are randomly sampled from the sets s f , s 1 and s 2 . minibatches of noise vectors are randomly drawn from the uniform distribution. models parameters of the discriminator and both generators are then sequentially updated using adam optimization algorithm [13] . we implemented our method by having two layers neural networks for each of the components of cond 2 gans. these neural nets are composed of 200 nodes in hidden layers with a sigmoid activation function. since the values of the generated samples are supposed to approximate any possible real value, we do not use the activation function in the outputs of both generators. 1 in this section, we present experimental results aimed at evaluating how the generation of views by cond 2 gans can help to take advantage of existing training examples, with many having an incomplete view, in order to learn an efficient classification function. we perform experiments on a publicly available collection, extracted from reuters rcv1/rcv2, that is proposed for multilingual multiclass text categorization 2 ( table 1 ). the dataset contains numerical feature vectors of documents originally presented in five languages (en, fr, gr, it, sp). in our experiments, we consider four pairs of languages with always english as one of the views ((en,fr),(en,sp),(en,it), (en,gr) ). documents in different languages belong to one and only one class within the same set of classes (k = 6); and they also have translations into all the other languages. these translations are obtained from a state-of-the-art statistical machine translation system [22] trained over the europal parallel collection using about 8.10 6 sentences for the 4 considered pairs of languages. 3 in our experiments, we consider the case where the number of training documents having their two versions is much smaller than those with only one of their available versions (i.e. m f m 1 + m 2 ). this corresponds to the case where the effort of gathering documents in different languages is much less than translating them from one language to another. to this end, we randomly select m f = 300 samples having their both views, m 1 = m 2 = 6000 samples with one of their views missing and the remaining samples without their translations for test. in order to evaluate the quality of generated views by cond 2 gans we considered two scenarios. in the first one (denoted by tenṽ), we test on english documents by considering the generation of these documents with respect to the other view (v ∈ {fr, gr, it, sp}) using the corresponding generator. in the second scenario (denoted by tẽ nv ), we test on documents that are written in another language than english by considering their generation on english provided by the other generator. for evaluation, we test the following four classification approaches along with cond 2 gans; one singleview approach and four multiview approaches. in the singleview approach (denoted by c v ) classifiers are the same as the discriminator and they are trained on the part of the training set with examples having their corresponding view observed. the multiview approaches are mkl [4] , coclassification (co-classif) [1] , unanimous vote ( mv b ) [2] . results are evaluated over the test set using the accuracy and the f 1 measure which is the harmonic average of precision and recall. the reported performance are averaged over 20 random (train/test) sets, and the parameters of adam optimization algorithm are set to α = 10 −4 , β = 0.5. on the value of the generated views. we start our evaluation by comparing the f 1 scores over the test set, obtained with cond 2 gans and a neural network having the same architecture than the discriminator d of cond 2 gans trained over the concatenated views of documents in the training set where the missing views are generated by machine translation. figure 2 shows these results. each point represents a class, where its abscissa (resp. ordinate) represents the test f 1 score of the neural network trained using mt (resp. one of the generators of cond 2 gans) to complete the missing views. all of the classes, in the different language pair scenarios, are above the line of equality, suggesting that the generated views by cond 2 gans provide higher value information than translations provided by mt for learning the neural network. this is an impressive finding, as the resources necessary for the training of mt is large (8.10 6 pairs of sentences and their translations); while cond 2 gans does both view completion and discrimination using only the available training data. this is mainly because both generators induce missing views with the same distribution than real pairs of views as stated in corollary 1. comparison between multiview approaches. we now examine the gains, in terms of accuracy, of learning the different multiview approaches on a collection where for other approaches than cond 2 gans the missing views are completed by one of the generators of our model. table 2 summarizes these results obtained by cond 2 gans, mkl, co-classif, and mv b for both test scenarios. in all cases cond 2 gans, provides significantly better results than other approaches. this provides empirical evidence of the effectiveness of the joint view generation and class prediction of cond 2 gans. furthermore, mkl, co-classif and cond 2 gans are binary classification models and tackle the multiclass classification case with one vs all strategy making them to suffer from class imbalance problem. results obtained using the f 1 measure are in line with those of table 2 and they are not reported for the sake of space. in this paper we presented cond 2 gans for multiview multiclass classification where observations may have missing views. the model consists of three neuralnetworks implementing a three players game between a discriminator and two generators. for an observation with a missing view, the corresponding generator produces the view conditionally on the other observed one. the discriminator is trained to recognize observations with a generated view from others having their views complete and to classify the latter into one of the existing classes. we evaluate the effectiveness of our approach on another challenging application than image analysis which is the domain of choice for the design of new gan models. our experiments on a subset of reuters rcv1/rcv2 show the effectiveness of cond 2 gans to generate high quality views allowing to achieve significantly better results, compared to the case where the missing views are generated by machine translation which requires a large collection of sentences and their translations to be tuned. as future study, we will be working on the generalization of the proposed model to more than 2 views. one possible direction is the use of an aggregation function of available views as a condition to the generators. a co-classification approach to learning from multilingual corpora learning from multiple partially observed views -an application to multilingual text categorization kernel independent component analysis multiple kernel learning, conic duality, and the smo algorithm combining labeled and unlabeled data with co-training multi-view generative adversarial networks deep generative image models using a laplacian pyramid of adversarial networks adversarial feature learning adversarially learned inference two view learning: svm-2k, theory and practice generative adversarial nets pac-bayesian analysis for a two-step hierarchical multiview learning approach adam: a method for stochastic optimization pose guided person image generation conditional image synthesis with auxiliary classifier gans unsupervised representation learning with deep convolutional generative adversarial networks improved techniques for training gans an rkhs for multi-view learning and manifold co-regularization unsupervised and semi-supervised learning with categorical generative adversarial networks a unified weight learning paradigm for multi-view learning cr-gan: learning complete representations for multi-view generation nrc's portage system for wmt multi-view image generation from a single-view key: cord-020834-ch0fg9rp authors: grand, adrien; muir, robert; ferenczi, jim; lin, jimmy title: from maxscore to block-max wand: the story of how lucene significantly improved query evaluation performance date: 2020-03-24 journal: advances in information retrieval doi: 10.1007/978-3-030-45442-5_3 sha: doc_id: 20834 cord_uid: ch0fg9rp the latest major release of lucene (version 8) in march 2019 incorporates block-max indexes and exploits the block-max variant of wand for query evaluation, which are innovations that originated from academia. this paper shares the story of how this came to be, which provides an interesting case study at the intersection of reproducibility and academic research achieving impact in the “real world”. we offer additional thoughts on the often idiosyncratic processes by which academic research makes its way into deployed solutions. we share the story of how an innovation that originated from academia-blockmax indexes and the corresponding block-max wand query evaluation algorithm of ding and suel [6] -made its way into the open-source lucene search library. this represents not only a case study in widespread reproducibility, since every recent deployment of lucene has access to these features and thus their performance benefits can be easily measured, but also of academic research achieving significant impact. how did these innovations make their way from the "ivory tower" into the "real world"? we recount the sequence of events, including false starts, that finally led to the inclusion of block-max wand in the latest major version of lucene (version 8), released in march 2019. we see this paper as having two main contributions beyond providing a narrative of events: first, we report results of experiments that attempt to match the original conditions of ding and suel [6] and present additional results on a number of standard academic ir test collections. these experiments characterize the performance of lucene's implementation and show the extent to which performance improvements are retained when moving from a research prototype to a production codebase. second, we offer a number of observations about the adoption of academic innovations, perhaps providing some insight into how academics might achieve greater real-world impact with their work. from its very beginnings in 1999, lucene has mostly existed in a "parallel universe" from academic ir researchers. part of this can be attributed to its "target audience": developers who wish to build real-world search applications, as opposed to researchers who wish to write papers. academic ir researchers have a long history of building and sharing search engines, dating back to the mid 1980s with cornell's smart system [4] . the tradition continues to this day, with lemur/indri [12, 13] and terrier [8, 14] being the most successful examples of open-source academic search engines, still popular with many researchers today. until recently, there has been little exchange between lucene and these systems, other than a few academic workshops [16, 21] . lucene has, for the longest time, been somewhat maligned in the academic ir community. for much of its existence, its default ranking model was a variant of tf-idf that was not only ad hoc, but demonstrably less effective than ranking models that were widely available in academic systems [18] . okapi bm25 was not added to lucene until 2011, 1 more than a decade after it gained widespread adoption in the research community; the consensus had long emerged that it was more effective than tf-idf variants. this lag has contributed to the broad perception by researchers that lucene produces poor search results and is illsuited for information retrieval research. this negative perception of lucene, however, began to change a few years ago. in 2015, an evaluation exercise known as the "open-source reproducibility challenge" [7] benchmarked seven open-source search engines and demonstrated that lucene was quite competitive in terms of both effectiveness and efficiency. it was the fourth fastest system (of seven) in terms of query evaluation, beating all the systems that were better than it in terms of effectiveness. since then, there has been a resurgence of interest in adopting lucene for information retrieval research, including a number of workshops that brought together like-minded researchers over the past few years [1, 2] . anserini [19, 20] is an open-source toolkit built on lucene that was specifically designed to support replicable information retrieval research by providing many research-oriented features missing from lucene, such as out-of-the-box support for a variety of common test collections. the project aims to better align ir researchers and practitioners, as lucene has become the de facto platform used in industry to build production search solutions (typically via systems such as elasticsearch and solr). the experiments in this paper were conducted with anserini. 3 from maxscore to block-max wand at berlin buzzwords in 2012, stefan pohl gave a presentation about max-score [17] to raise awareness about efficient retrieval techniques in the lucene community [15] . the presentation was accompanied by a working prototype. 2 this contribution was exciting but also challenging to integrate as it conflicted with some of the flexibility that lucene provides, requiring an index rewrite. there were ideas on how to address these issues, but they entailed a lot of effort, and so the issue remained stalled for about five years. five years is a long time and many changes occurred meanwhile. the switch from tf-idf to bm25 as lucene's default scoring function in 2015 created a natural upper bound on scores due to bm25's saturation effect, which made it possible to implement retrieval algorithms that reasoned about maximum scores without changes to lucene's index format. this led to an effort to implement a general-purpose wand [3] , based on a previous implementation for booleanquery. lucene received support for wand at the end of 2017 (although it wasn't released until version 8.0 with block-max indexes). implementing wand introduced two new issues. first, the total hit count would no longer be accurate, since not all matches are visited. common analytics use cases depend on this count, and many search engines display this value in their interfaces (see additional discussion in sect. 5). second, the fact that some lucene queries could produce negative scores became problematic, so lucene now requires positive scores. 3 support for block-max indexes was the final feature that was implemented, based on the developers' reading of the paper by ding and suel [6] , which required invasive changes to lucene's index format. note that the paper describes directly storing the maximum impact score per block, which fixes the scoring function at indexing time. to provide flexibility in being able to swap in different scoring functions, the lucene implementation stores all tf (term frequency) and dl (document length) pairs that might yield the maximum score. if we have one such pair (tf i , dl i ) then we can remove all other pairs (tf j , dl j ) where tf j ≤ tf i ∧ dl j ≥ dl l , since they are guaranteed to yield lower (or equal) scores-based on the assumption that scores increase monotonically with increasing tf and decreasing dl. this is implemented by accumulating all such pairs in a tree-like structure during the indexing process. these pairs are stored in skip lists, so the information is available to groups of 8, 64, 512, 4096, . . . blocks, allowing query evaluation to skip over more than one block at a time. an interesting coda to this story is that academic researchers were exploring alternatives to per-block impact scores circa 2017, for exactly the same reason (to allow the scoring model to be defined at search time). for example, macdonald and tonellotto [10] showed how to derive tight approximate upper bounds for block-max wand, based on work that dates back to 2011 [9] . similarly, the recently-released pisa research system stores flexible block-level metadata [11] . unfortunately, the lucene developers were not aware of these developments during their implementation. the journey from maxscore to block-max wand concluded in march 2019, with the rollout of all these features in the version 8.0 release of lucene. they are now the out-of-the-box defaults in the world's most popular search library. during the implementation of block-max wand, performance improvements were quantified in terms of lucene's internal benchmark suite, which showed a 3× to 7× improvement in query evaluation performance. as part of a formal reproducibility effort, we present experiments that attempt to match, to the extent practical, the original conditions described by ding and suel [6] . according to the paper, experiments were conducted on the gov2 web collection, on a randomly-selected subset of 1000 queries from the trec 2005 and 2006 efficiency tracks, which we were able to obtain from the authors. for their experiments, the inverted index was completely loaded into main memory and query evaluation latency was measured to retrieval depth ten. our experiments were conducted with the anserini ir toolkit, 4 comparing v0.5.1, which depends on lucene 7.6 and uses an optimized exhaustive or query evaluation strategy [5] with v0.6.0, which depends on lucene 8.0 and uses blockmax wand. we used anserini's standard regression test settings on the different collections, as described on its homepage. results represent averages over three trials on a warm cache. while the indexes were not explicitly loaded into memory, lucene benefits from caching at the os level. all experiments were conducted using a single thread on an otherwise idle server with dual intel xeon e5-2699 v4 processors and 1tb ram running rhel (release 7.7). results are shown in table 1 , where figures in the top three rows are copied from table 1 in the original paper. it is interesting that ding and suel report a much larger increase in performance comparing exhaustive or to bmw (18× on trec 2005 and 8× on trec 2006) than the comparable conditions in lucene (a more modest improvement of around 3×). this is due to a more optimized implementation of exhaustive or in lucene, which, for example, implements block processing [5] . interestingly, ding and suel report faster query evaluation in absolute terms, even on hardware that is much older: among the differences include c++ vs. java, as well as the simplicity of a research prototype vs. the realities of a fully-featured search library. beyond implementation differences, lucene must additionally compute the upper bound scores per block from the stored (tf, dl) pairs on the fly. we also report performance evaluations on two other standard test collections frequently used in academic information retrieval: clueweb09b and clueweb12-b13, with the same sets of queries. these results are shown in table 2 , where we report figures for different values of retrieval depth k, also averaged over three trials. these numbers are consistent with fig. 7 in ding and suel's paper: performance of exhaustive or drops modestly as depth k increases, but bmw performance degrades much more quickly. this is exactly as expected. finally, we quantify the modest increase in indexing time due to the need to maintain (tf, dl) pairs in the inverted indexes, shown in table 3 (averaged over three trials, using 44 threads in all cases). these experiments used anserini's default regression settings on the respective collections, which builds full positional indexes and also stores the raw documents. the story of block-max wand in lucene provides a case study of how an innovation that originated in academia made its way into the world's most widely-used search library and achieved significant impact in the "real world" through hundreds of production deployments worldwide (if we consider the broader lucene ecosystem, which includes systems such as elasticsearch and solr). as there are very few such successful case studies (the other prominent one being the incorporation of bm25 in lucene), it is difficult to generalize these narratives into "lessons learned". however, here we attempt to offer a few observations about how academic research might achieve greater real-world impact. in short, block-max wand is in lucene because the developers learned about ding and suel and decided to reimplement it. this is somewhat stating the obvious, but this fateful decision highlights the idiosyncratic nature of technology adoption. we could imagine alternatives where the lucene developers had not come across the paper and developed a comparable solution in isolation, or they might have known about the paper and elected to take a different approach. in either case, the lucene solution would likely differ from block-max wand. this would be akin to convergent evolution in evolutionary biology, whereby different organisms independently evolve similar traits because they occupy similar environments. in such an "alternate reality", this paper would be comparing and contrasting different solutions to handling score outliers, not describing a reproducibility effort. to bring researchers and practitioners closer together, we recommend that the former be more proactive to "evangelize" their innovations, and the latter be more diligent in consulting the literature. eight years passed from the publication of the original paper (2011) until the release of lucene that included block-max wand (2019). the entire course of innovation was actually much longer if we trace the origins back to maxscore (1995) and wand (2003). one obvious question is: why did it take so long? there are many explanations, the most salient of which is the difference between a research prototype and a fully-featured search library that is already widely deployed. this decomposes into two related issues, the technical and the social. from a technical perspective, supporting bmw required invasive changes to lucene's index format and a host of related changes in scoring functionsfor example, scores could no longer be negative, and implementations could no longer access arbitrary fields (which was an api change). these had to be staged incrementally. concomitant with technical changes and backwards-compatibility constraints were a host of "social" changes, which required changing users' expectations about the behavior of the software. in short, bmw was not simply a drop-in replacement. for example, as discussed in sect. 3, the hit count was no longer accurate, which required workarounds for applications that depended on the value. because such major changes can be somewhat painful, they need to be justified by the potential benefits. this means that only dramatic improvements really have any hope of adoption: multiple-fold, not marginal, performance gains. an interesting side effect is that entire generations of techniques might be skipped, in the case of lucene, directly from exhaustive or to bmw, leapfrogging intermediate innovations such as maxscore and wand. aiming to achieve real-world impact with academic research is a worthy goal, and we believe that this case study represents an endorsement of efforts to better align research prototypes with production systems, as exemplified by lucenebased projects like anserini. if academic researchers are able to look ahead "down the road" to see how their innovations might benefit end applications, the path from the "ivory tower" to the "real world" might become more smoothly paved. proceedings of the 40th annual international acm sigir conference on research and development in information retrieval (sigir 2017) lucene4ir: developing information retrieval evaluation resources using lucene efficient query evaluation using a two-level retrieval process implementation of the smart information retrieval system. department of computer science tr space optimizations for total ranking faster top-k document retrieval using block-max indexes toward reproducible baselines: the open-source ir reproducibility challenge from puppy to maturity: experiences in developing terrier upper-bound approximations for dynamic pruning upper bound approximation for blockmaxwand pisa: performant indexes and search for academia combining the language model and inference network approaches to retrieval indri at trec 2004: terabyte track terrier: a high performance and scalable information retrieval platform efficient scoring in lucene open source information retrieval: a report on the sigir 2012 workshop query evaluation: strategies and optimizations yet another comparison of lucene and indri performance anserini: enabling the use of lucene for information retrieval research anserini: reproducible ranking baselines using lucene sigir06 workshop report: open source information retrieval systems (osir06). in: sigir forum acknowledgments. this work was supported in part by the natural sciences and engineering research council (nserc) of canada. we'd like to thank craig macdonald, joel mackenzie, antonio mallia, and nicola tonellotto for helpful discussions on the intricacies of computing flexible per-block score bounds, and torsten suel for providing us with the original queries used in their evaluations. key: cord-020813-0wc23ixy authors: hashemi, helia; aliannejadi, mohammad; zamani, hamed; croft, w. bruce title: antique: a non-factoid question answering benchmark date: 2020-03-24 journal: advances in information retrieval doi: 10.1007/978-3-030-45442-5_21 sha: doc_id: 20813 cord_uid: 0wc23ixy considering the widespread use of mobile and voice search, answer passage retrieval for non-factoid questions plays a critical role in modern information retrieval systems. despite the importance of the task, the community still feels the significant lack of large-scale non-factoid question answering collections with real questions and comprehensive relevance judgments. in this paper, we develop and release a collection of 2,626 open-domain non-factoid questions from a diverse set of categories. the dataset, called antique, contains 34k manual relevance annotations. the questions were asked by real users in a community question answering service, i.e., yahoo! answers. relevance judgments for all the answers to each question were collected through crowdsourcing. to facilitate further research, we also include a brief analysis of the data as well as baseline results on both classical and neural ir models. with the rising popularity of information access through devices with small screens, e.g., smartphones, and voice-only interfaces, e.g., amazon's alexa and google home, there is a growing need to develop retrieval models that satisfy user information needs with sentence-level and passage-level answers. this has motivated researchers to study answer sentence and passage retrieval, in particular in response to non-factoid questions [1, 18] . non-factoid questions are defined as open-ended questions that require complex answers, like descriptions, opinions, or explanations, which are mostly passage-level texts. questions like "how to cook burgers?" are non-factoid. we believe this type of questions plays a pivotal role in the overall quality of question answering systems, since their technologies are not as mature as those for factoid questions, which seek precise facts, such as "at what age did rossini stop writing opera?". despite the widely-known importance of studying answer passage retrieval for non-factoid questions [1, 2, 8, 18] , the research progress for this task is limited by the availability of high-quality public data. some existing collections, e.g., [8, 13] , consist of few queries, which are not sufficient to train sophisticated machine learning models for the task. some others, e.g., [1] , significantly suffer from incomplete judgments. most recently, cohen et al. [3] developed a publicly available collection for non-factoid question answering with a few thousands questions, which is called wikipassageqa. although wikipassageqa is an invaluable contribution to the community, it does not cover all aspects of the non-factoid question answering task and has the following limitations: (i) it only contains an average of 1.7 relevant passages per question and does not cover many questions with multiple correct answers; (ii) it was created from the wikipedia website, containing only formal text; (iii) more importantly, the questions in the wikipassageqa dataset were generated by crowdworkers, which is different from the questions that users ask in real-world systems; (iv) the relevant passages in wikipassageqa contain the answer to the question in addition to some surrounding text. therefore, some parts of a relevant passage may not answer any aspects of the question; (v) it only provides binary relevance labels. to address these shortcomings, in this paper, we create a novel dataset for non-factoid question answering research, called antique, with a total of 2,626 questions. in more detail, we focus on the non-factoid questions that have been asked by users of yahoo! answers, a community question answering (cqa) service. non-factoid cqa data without relevance annotation has been previously used in [1] , however, as mentioned by the authors, it significantly suffers from incomplete judgments (see sect. 2 for more information on existing collections). we collected four-level relevance labels through a careful crowdsourcing procedure involving multiple iterations and several automatic and manual quality checks. note that we paid extra attention to collect reliable and comprehensive relevance judgments for the test set. therefore, we annotated the answers after conducting result pooling among several term-matching and neural retrieval models. in summary, antique provides annotations for 34,011 question-answer pairs, which is significantly larger than many comparable datasets. we further provide brief analysis to uncover the characteristics of antique. moreover, we conduct extensive experiments with antique to present benchmark results of various methods, including classical and neural ir models on the created dataset, demonstrating the unique challenges antique introduces to the community. to foster research in this area, we release antique. 1 factoid qa datasets. trec qa [14] and wikiqa [17] are examples of factoid qa datasets whose answers are typically brief and concise facts, such as named entities and numbers. insuranceqa [5] is another factoid dataset in the domain of insurance. antique, on the other hand, consists of open-domain non-factoid questions that require explanatory answers. the answers to these questions are often passage level, which is contrary to the factoid qa datasets. non-factoid qa datasets. there have been efforts for developing non-factoid question answering datasets [7, 8, 16] . keikha et al. [8] introduced the webap dataset, which is a non-factoid qa dataset with 82 queries. the questions and answers in webap were not generated by real users. there exist a number of datasets that partially contain non-factoid questions and were collected from cqa websites, such as yahoo! webscope l6, qatar living [9] , and stackexchange. these datasets are often restricted to a specific domain, suffer from incomplete judgments, and/or do not contain sufficient non-factoid questions for training sophisticated machine learning models. the nfl6 dataset [1] is a collection of non-factoid questions extracted from the yahoo! webscope l6. its main drawback is the absence of complete relevance annotation. previous work assumes that the only answer that the question writer has marked as correct is relevant, which is far from being realistic. that is why we aim to collect a complete set of relevance annotations. wikipassageqa is another non-factoid qa dataset that has been recently created by cohen et al. [3] . as mentioned in sect. 1, despite its great potentials, it has a number of limitations. antique addresses these limitations to provide a complementary benchmark for nonfactoid question answering (see sect. 1). more recently, microsoft has released the ms marco v2.1 passage re-ranking dataset [10] , containing a large number of queries sampled from the bing search engine. in addition to not being specific to non-factoid qa, it significantly suffers from incomplete judgments. in contrast, antique provides a reliable collection with complete relevance annotations for evaluating non-factoid qa models. following cohen et al. [1] , we used the publicly available dataset of non-factoid questions collected from the yahoo! webscope l6, called nfl6. we conducted the following steps for pre-processing and question sampling: (i) questions with less than 3 terms were omitted (excluding punctuation marks); (ii) questions with no best answer (â) were removed; (iii) duplicate or near-duplicate questions were removed. we calculated term overlap between questions and from the questions with more than 90% term overlap, we only kept one, randomly; (iv) we omitted the questions under the categories of "yahoo! products" and "computers & internet" since they are beyond the expertise of most workers; (v) from the remaining data, we randomly sampled 2,626 questions (out of 66,634). each question q in nfl6 corresponds to a list of answers named 'nbest answers', which we denote with a = {a 1 , . . . , a n }. for every question, one answer is marked by the question author on the community web site as the best answer, denoted byâ. it is important to note that as different people have different information needs, this answer is not necessarily the best answer to the question. also, many relevant answers have been added after the user has chosen the correct answer. nevertheless, in this work, we respect the users' explicit feedback, assuming that the candidates selected by the actual user are relevant to the query. therefore, we do not collect relevance assessments for those answers. we created a human intelligence task (hit) on amazon mechanical turk, in which we presented workers with a question-answer pair, and instructed them to annotate the answer with a label between 1 to 4. the instructions started with a short introduction to the task and its motivations, followed by detailed annotation guidelines. since workers needed background knowledge for answering the majority of the questions, we also includedâ in the instructions and called it a "possibly correct answer." in some cases, the question is very subjective and could have multiple correct answers. this is why it is called "possibly correct answer" to make it clear in the instructions that other answers could potentially be different from the provided answer, but still be correct. label definitions. to facilitate the labeling procedure, we described labels in the form of a flowchart to users. our aim was to preserve the notion of relevance in qa systems as we discriminate it with the typical topical relevance definition in ad-hoc retrieval tasks. the definition of each label is as follows: label 4: it looks reasonable and convincing. its quality is on par with or better than the "possibly correct answer". note that it does not have to provide the same answer as the "possibly correct answer". label 3: it can be an answer to the question, however, it is not sufficiently convincing. there should be an answer with much better quality for the question. label 2: it does not answer the question or if it does, it provides an unreasonable answer, however, it is not out of context. therefore, you cannot accept it as an answer to the question. label 1: it is completely out of context or does not make any sense. we included 15 diverse examples of annotated qa pairs with explanation of why and how the annotations were done. overall, we launched 7 assignment batches, appointing 3 workers to each qa pair. in cases where the workers could agree on a label (i.e., majority vote), we considered the label as the ground truth. we then added all qa pairs with no agreement to a new batch and performed a second round of annotation. it is interesting to note that the ratio of pairs with no agreement was nearly the same among the 7 batches (∼13%). in the very rare cases of no agreement after two rounds of annotation (776 pairs), an expert annotator decided on the final label. to allow further analysis, we have added a flag in the dataset identifying the answers annotated by the expert annotator. in total, the annotation task costed 2,400 usd. to ensure the quality of the data, we limited the hit to the workers with over 98% approval rate, who have completed at least 5,000 assignments. 3% of qa pairs were selected from a set of quality check questions with obviously objective labels. it enabled us to identify workers who did not provide high-quality labels. moreover, we recorded the click log of the workers to detect any abnormal behavior (e.g., employing automatic labeling scripts) that would affect the quality of the data. finally, we constantly performed manual quality checks by reading the qa pairs and their respective labels. the manual inspection was done on the 20% of each worker's submission as well as the qa pairs with no agreement. training set. in the training set, we annotate the list a (see sect. 3) for each query, and assume that for each question, answers to the other questions are irrelevant. as we removed similar questions from the dataset, this assumption is fair. to test this assumption, we sampled 100 questions from the filtered version of nfl6 and annotated the top 10 results retrieved by bm25 using the same crowdsourcing procedure. the results showed that only 13.7% of the documents (excluding a) were annotated as relevant (label 3 or 4). this error rate can be tolerated in the training process as it enables us to collect significantly larger amount of training labels. on the other hand, for the test set we performed pooling to label all possibly relevant answers. in total, the antique's training set contains 27,422 answer annotations as it shown in table 1 , that is 11.3 annotated candidate answers per training question, which is significantly larger than its similar datasets, e.g., wikipassageqa [3] . test set. the test set in antique consists of 200 questions which were randomly sampled from nfl6 after pre-processing and filtering. statistics of the test set can be found in table 1 . the set of candidate questions for annotation was selected by performing depth-k (k = 10) pooling. to do so, we considered the union of the top k results of various retrieval models, including term-matching and neural models (listed in table 2 ). we took the union of this set and "nbest answers" (set a) for annotation. here, we present a brief analysis of antique to highlight its characteristics. table 1 lists general statistics of antique. as we see, antique consists of 2,426 non-factoid questions that can be used for training, followed by 200 questions as a test set. furthermore, antique contains 27.4k and 6.5k annotations (judged answers) for the train and test sets, respectively. we also report the total number of answers with specific labels. workers performance. overall, we launched 7 crowdsourcing batches to collect antique. this allowed us to identify and ban less accurate workers. as reported in table 1 , a total number of 577 workers made over 148k annotations (257 per worker), out of which we rejected 12% because they failed to satisfy the quality criteria. questions distribution. figure 1 shows how questions are distributed in antique by reporting the top 40 starting trigrams of the questions. as shown in the figure, majority of the questions start with "how" and "why," constituting 38% and 36% of the questions, respectively. it is notable that, according to fig. 1 , a considerable number of questions start with "how do you," "how can you," "what do you," and "why do you," suggesting that their corresponding answers would be highly subjective and opinion based. also, we can see a major fraction of questions start with "how can i" and "how do i," indicating the importance and dominance of personal questions. fig. 2 , we plot the distribution for the number of 'nbest answers' (|a|). we see that the majority of questions have 9 or less nbest answers (=54%) and 82% of questions have 14 or less nbest answers. the distribution, however, has a long tail which is not shown in the figure. in this section, we provide benchmark results on the antique dataset. we report the results for a wide range of retrieval models in table 2 . in this experiment, we report a wide range of standard precision-and recall-oriented retrieval metrics (see table 2 ). note that for the metrics that require binary labels (i.e., map, mrr, and p@k), we assume that the labels 3 and 4 are relevant, while 1 and 2 are non-relevant. due to the definition of our labels (see sect. 3), we recommend this setting for future work. for ndcg, we use the four-level relevance annotations (we mapped our 1 to 4 labels to 0 to 3). as shown in the table, the neural models significantly outperform bm25, an effective term-matching retrieval model. among all, bert [4] provides the best performance. recent work on passage retrieval also made similar observations [11, 12] . since map is a recall-oriented metric, the results suggest that all the models still fail at retrieving all relevant answers. there is still a large room for improvement, in terms of both precision-and recall-oriented metrics. this paper introduced antique; a non-factoid question answering dataset. the questions in antique were sampled from a wide range of categories on yahoo! answers, a community question answering service. we collected four-level relevance annotations through a multi-stage crowdsourcing as well as expert annotation. in summary, antique consists of 34,011 qa-pair relevance annotations for 2,426 and 200 questions in the training and test sets, respectively. additionally, we reported the benchmark results for a set of retrieval models, ranging from term-matching to recent neural ranking models, on antique. our data analysis and retrieval experiments demonstrated that antique introduces unique challenges while fostering research for non-factoid question answering. end to end long short term memory networks for nonfactoid question answering a hybrid embedding approach to noisy answer passage retrieval wikipassageqa: a benchmark collection for research on non-factoid answer passage retrieval bert: pre-training of deep bidirectional transformers for language understanding applying deep learning to answer selection: a study and an open task a deep relevance matching model for ad-hoc retrieval new collection announcement: focused retrieval over the web evaluating answer passages using summarization measures semeval-2017 task 3: community question answering ms marco: a human generated machine reading comprehension dataset passage re-ranking with bert investigating the successes and failures of bert for passage re-ranking evaluating and predicting answer quality in community qa what is the jeopardy model? a quasisynchronous grammar for qa anmm: ranking short answer texts with attention-based neural matching model beyond factoid qa: effective methods for non-factoid answer sentence retrieval wikiqa: a challenge dataset for open-domain question answering document summarization for answering non-factoid queries acknowledgement. this work was supported in part by the center for intelligent information retrieval and in part by nsf iis-1715095. any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsor. key: cord-020891-lt3m8h41 authors: witschel, hans friedrich; riesen, kaspar; grether, loris title: kvgr: a graph-based interface for explorative sequential question answering on heterogeneous information sources date: 2020-03-17 journal: advances in information retrieval doi: 10.1007/978-3-030-45439-5_50 sha: doc_id: 20891 cord_uid: lt3m8h41 exploring a knowledge base is often an iterative process: initially vague information needs are refined by interaction. we propose a novel approach for such interaction that supports sequential question answering (sqa) on knowledge graphs. as opposed to previous work, we focus on exploratory settings, which we support with a visual representation of graph structures, helping users to better understand relationships. in addition, our approach keeps track of context – an important challenge in sqa – by allowing users to make their focus explicit via subgraph selection. our results show that the interaction principle is either understood immediately or picked up very quickly – and that the possibility of exploring the information space iteratively is appreciated. today's information repositories are numerous, diverse and often very large. there is an increasing demand for accessing and querying these repositories using questions posed in natural language. while there is a long history of research in the fields of question answering (over both structured and unstructured content) and natural language interfaces to databases (nlidb), as further elaborated in sect. 2, the field of (complex) sequential question answering [5, 14] is still rather new. possibly fuelled by the rise of chatbot technology and the resulting expectations of users, it claims that a more interactive approach to both fields will better meet user needs. its main assumption is that users do not simply ask a question to a knowledge base and then quit. instead, users tend to break down complex questions into a series of simple questions [5] . in addition, as known from exploratory search [12] , users who do not have a very clearly articulated information need and/or who aim at getting familiar with a new field of knowledge tend to ask series of questions where one answer triggers the next question. that is, a user might ask a rather "fuzzy" first question (such as "what are important topics in the field of 'information retrieval' ?") and then -when studying the answer -start to think of new questions, concerning some of the new concepts found in that answer. although the concept of exploratory search is well known from the field of information retrieval, this exploratory motivation for performing sequential question answering (over structured knowledge bases) has not been studied so far. in any case, sequential question answering raises the major challenge of keeping track of context: since they assume the context to be known from the prior questions and answers, users tend to leave away sentence elements [14] . especially in exploratory search settings, answers to fuzzy questions can be very complex, involving a large number of concepts and relations. hence, researchers have proposed various kinds of visualisations in order to aid users in grasping such complexity and studying relationships between concepts [1, 20] . in our work, we aim at building a context-aware sequential question answering system, especially suited for exploratory search. to this end, the solution is based on a knowledge graph -which integrates information from various structured and unstructured data sources, see sect. 3.1. since the visualization of graphs provides an intuitive overview of complex structures and relationships [2] , our system allows users to ask questions in natural language, but provides answers via a visual representation of subgraphs of the underlying knowledge graph. it supports both the user and the system in keeping track of the context/current focus of the search via a novel interaction concept that combines pointing/clicking and asking questions in natural language, described in sect. 3.2. we will show empirically that users appreciate the new interaction concept and its ability to define context and focus graphically, see sect. 4. both question answering and natural language interfaces to databases (nlidb, see [6] for a survey) have a long history. they share many characteristics since both support querying of knowledge bases using natural language. many question answering systems retrieve answers from textual (i.e. unstructured) resources, but there are also many approaches based on structured content, often in the form of ontologies [11] . in nlidb, many challenges have been addressed, e.g. making systems domain-independent [10] or overcoming specific difficulties with certain query languages, above all sql [22] . recent advances in this area are relying on sequence-to-sequence models [7, 17] , based on encoding and decoding of sequences via deep (reinforcement) learning. an obvious drawback of these supervised learning approaches -as opposed to earlier hand-crafted rule-based grammars -is the amount of training data required. although large hand-annotated datasets have been published [24, 25] , trained models cannot be expected to be fully domain-independent. while the fields of question answering (over structured data), semantic parsing and nlidb are obviously quite advanced, researchers have only recently begun to study the domain of "sequential question answering" (sqa). this new focus on interactive, dialog-driven access to knowledge bases is based on the insight that users rarely pose a question to such a knowledge base and then quit [3, 14] . instead, a more common and natural access pattern consists in posing a series of questions. most researchers in sqa assume that the motivation for dialogs comes from the need to decompose complex questions into simple ones [5, 14] . some researchers propose to perform such decomposition algorithmically [15] , while others provide evidence that it is more natural and realistic to assume that humans will like to perform this decomposition themselves, resulting in a series of simple, but inter-related questions [5] . a key challenge in any form of sequential or conversational question answering is the resolution of ellipses (e.g. omissions of arguments in relations) or anaphora which are very frequent in a dialogue where the user expects the system to keep track of the context [5, 9, 14] . these approaches all assume that a searcher always accesses a knowledge base with a clear question in mind. as outlined above, we advocate a wider perspective on sqa, including scenarios of an exploratory nature. in information retrieval, it has been thoroughly accepted that there exist situations in which users are unable to clearly articulate information needs, e.g. when trying to get acquainted with a new field where terminology is still unknown [12] . thus, users would like to explore, and often their questions become better articulated as they learn more about the new field. in order to support them in grasping relationships between new concepts in the -often very complex -answers to their fuzzy questions, ir researchers have proposed result set visualisations that provide a better overview than the typical ranked lists of document references [1, 20] . using visualisations, especially of graphs/ontologies as an output of retrieval systems has also been proposed, mainly in qa and nlidb that are based on knowledge graphs [2, 13, 23] . visualising graph query results is different from visualising graphs in general; the former resembles generation of results snippets in text retrieval [16] . however, we can learn and employ mechanisms from general approaches to analysing large graphs, e.g. by applying global ranking mechanisms (such as pagerank) or by summarizing properties of selected nodes [8] . as pointed out in [19] , visual graph analysis requires, besides the visual representation of graph structures, to have good interaction mechanisms and algorithmic analysis, such as aggregation/merging of nodes, identification of certain graph structures (such as cliques) or node ranking mechanisms such as pagerank. additional challenges originate in the fuzziness of natural language and the potential resulting number of (partially) matching result graphs. graph summarization approaches have been proposed as a solution [21, 23] -where summarized/aggregated graph structures play the role of snippets. another approach [4] uses result previews to narrow down result sets via "early" user interaction. while approaches to semantic parsing, nlidb and question answering over structured data are well studied, there is a recent rise in interest in better studying and supporting the interaction in sequential question answering (sqa) scenarios. however, the emerging field of sqa lacks -in our opinion -a clear idea of why users want to engage in a conversation. we claim that one important motivation can be found in exploratory settings where users need to first gain insights by interacting with a knowledge base, before being able to ask the "right" questions. another challenge in sqa is keeping track of context: in their survey on semantic parsing, kamath & das [6] mention "adding human in the loop for query refinement" as a promising future research direction in cases where the system is uncertain in its predictions. our contribution consists mainly in proposing a new interaction paradigm which allows users to ask questions in natural language and to receive answers in the form of visualised subgraphs of a knowledge graph. users can then interact with that subgraph to define the focus of their further research, before asking the next question. with this human involvement, we can show empirically both how the human benefits from clarifying the search direction while exploring the knowledge graph and how the machine is supported in understanding incomplete questions better because their context is made explicit. we further use a robust query relaxation approach to trade precision for recall when recall is low. our approach is domain-independent and does not require training data -it only requires a specification of node type names and their possible synonyms. it can be seen as a "traditional" and simple grammar-based approach -the focus is not on sophisticated semantic parsing (we might add e.g. sequence-to-sequence models later), but on the interactive process of graph exploration via natural language. the knowledge graph underlying our experiments was constructed out of a collection of heterogeneous sources and stored in a neo4j graph database 1 . for our experiments, we chose books as a domain and aimed at retrieving all information -from various sources -which users (leisure-time readers, students, ...) might find relevant, ranging from core bibliographic information, over authorrelated information (affiliation/prizes won) to reviews and social media coverage of books. to populate it, we implemented a collection of parsers for a variety of data sources. 2 : -for structured data, we built an xml parser (which can be applied to structured xml databases, but also for semi-structured xml files) and an rdf parser. the xml parser was used to integrate a sample of data from the bibliographic platform ipegma 3 , while the rdf parser was applied to the dbpedia sparql endpoint 4 to retrieve data about books, persons, their institutes and awards. the ipegma data covers mostly german books while dbpedia data is focused on english books. -in terms of semi-structured data, our html parser can process web content and a special twitter parser deals with tweets (and uses the html parser to process web pages linked from tweets). we applied the html parser to the websites literaturkritik.de and www.complete-review.com to retrieve book reviews and related book metadata in german and english. the twitter parser was applied to a collection of twitter accounts of major publishers whose timelines were analysed for tweets referring to books. -we also integrated a sentiment analysis service (aylien text api 5 ) as a typical example of analysis of the unstructured part of webpages, i.e. the plain text. in our case, we applied the service to the book reviews from literaturkritik.de to find out whether reviews were positive or negative. for www. complete-review.com, this information could be parsed directly from the web page. in neo4j, it is not required to define a schema (i.e. node or relation types) before inserting nodes or relationships. we used this property heavily: each parser has a configuration file in which one can define node and relation types to be extracted. we have developed a special syntax with which one can define the patterns to be searched within the various data sources to retrieve the corresponding data. this means that parsers can be extended to find new types of nodes and relationships and/or cover new data sources of known type, without the need to modify the program code of the parser. typically, the specifications for various data sources have overlapping node types, thus resulting in a data integration task. in order to match identical nodes (e.g. the same book) found in different data sources, the definitions also specify a "uniquneness attribute" (similar to a primary key in relational databases). as a result, the knowledge base consists of a single integrated graph. we have chosen a graph database because graphs are a very natural way of modeling relationships and are easy to visualise and interact with [2] . as laid out in sect. 2, most previous work sees sequential question answering as a conversation in which complex questions are broken down into simpler ones. for instance, iyyer et al. [5] assume that users have already at the initial state of a conversation a complex question in mind -which they then decompose into simpler ones. in contrast, our new interaction concept aims at supporting scenarios that are more exploratory in nature (cf. exploratory search in text retrieval [12] ). in such settings, users often ask series of questions that emerge one from another -i.e. the answer to a first question triggers the next one etc. -without the final goal of such a conversation being clear initially. we propose a novel interaction mechanism for such an exploratory "conversation", where questions are posed in natural language, but answers are given in the form of subgraph visualisations, with a possibility to interact and select parts of subgraphs for further exploration (again via asking questions). note that it does not play a role whether a user starts from general concepts to "zoom in" to more specific ones or vice versa. in exploratory search, it is typical that -since the nature of the problem is unclear to the user -queries are imprecise or "tentative" [20] . this implies very often that the answers -much more than the questions or queries -can be quite complex. as pointed out in [1] , systems that support exploration hence often offer visualisation of search results as well as interaction mechanisms for further exploration. in our case, results are (possibly large) subgraphs of a given knowledge graph. by studying such a subgraph and interacting with it, a user may learn about important concepts and relations in a domain -and this leads to asking the next question(s). a next question may aim at either filtering the current subgraph or further broadening the scope by expanding a subgraph region with further related nodes. the design of our interaction concept was informed by a questionnaire which was filled out by a sample of 16 students. participants received a description of a situation (e.g. having read a good book) and were asked to formulate some questions that they would have in such a situation. we analysed their answers, looking for common patterns of questions and expected result sets. our resulting interaction concept is very simple: based on an initial keyword search or question in step 0, a user finds an entry point into the graph, i.e. an initial subgraph g 0 . from this point on -provided that the user would like to continue the current session -there are two main possibilities for exploration in each step i: 1. use the graphical user interface, e.g. expand the subgraph g i by unhiding all nodes related to a chosen node. 2. select a node or a set of nodes n gi as a "context" and ask a question about it. selection can be done (a) directly via one or more clicks on nodes or (b) by selecting all nodes of a certain type via a button. each interaction leads to a new graph g i+1 . while option 1 is not new, option 2 can lead to a new form of sequential question answering, with questions being asked in natural language and answers given as visualisations of subgraphs. this combination is user-friendly since on the one hand -as a basis of all nlidb research and conversational interfacesnatural language is the most natural form of expressing information needs. on the other hand, researchers in both information retrieval [1] and graph querying [23] communities use visualisations for improving the user-friendliness of exploratory search. in addition, we claim (and will later show empirically) that, while it is not natural for users to repeat entity names from an earlier question, it is rather natural for them to select preliminary results and thus make context explicit. we will show that such selection is even often helpful for their own understanding of how a question-answer-sequence develops and what they have learned so far/what they want to learn next. since the user specifies the context explicitly when using option 2, it is easy for our system to fill in missing parts of questions by simply assuming that they originate from that context. figure 1 illustrates the interaction concept with a small "exploration session" from the book domain (see sect. 3.1). in short, the session consists in a user searching for an author, then demanding to see all books from that author and finally asking which of these books have positive reviews. note how the visualisation of the result graph helps her to get a quick overview of complex structures -for instance to see at a glance which books have many vs. few positive reviews (yellow nodes) in the last result. in order to realise the interaction described in the previous section, kvgr builds several components on top of the knowledge graph (see sect. 3.1). all of these components are visible on the user interface, the numbers in fig. 2 refer to the corresponding (backend) components in the following enumeration: 1. fielded keyword search: each node in the knowledge graph is treated as a document and its (textual) attributes as fields. field weights are domainspecific -in the book domain the "title" field of books will have a higher weight than e.g. the "genre" field. the number of shown nodes is limited by applying a cut-off to node scores. 2. semantic parser, see sect. 3.4 3. graph visualisation and interaction, allowing common basic graph interactions, plus selecting a context, see sect. 3.2. since semantic parsing is not the core contribution of our work, we have built a simple, but robust grammar for parsing. it takes advantage of the interaction concept and the basic principles of graphs, but makes no further assumptions about the graph schema -it can be adapted easily to new domains simply by providing a lexicon of node types (see below). the grammar consists of jape rules in gate 6 , which annotate occurrences of graph nodes in user utterances, based on a simple lookup mechanism using a lexicon with manually maintained synonyms. each annotation is associated with a number of features, see fig. 3 . the annnotated questions are then passed to a cypher generator, which simply takes all nodes found in an utterance and generates a relationship pattern that is matched against the graph. we illustrate our parser with the example shown in fig. 3 . the parts of the question recognised as nodes are put in bold font, their extracted features are presented in the box above. the grammar has marked "journal" as a return node type and "it" as referring to a current user selection ("this=true"). here, the interaction concept is exploited: because the user has selected a book (let us assume, the book with id 629025), the system can assume that the pronoun "it" refers to that current selection (the same would apply to a phrase like "this book"). this information is enough for the cypher generator to generate a cypher query as follows: this query, however, will not retrieve anything since the question contains an ellipsis: it should actually be formulated as "which journals have published a review about it?". that is, the system needs to extend the pattern to allow an intermediate node type related to both the current selection and the return type nodes. to this end, we have implemented a query relaxation mechanism which will first try out the above cypher query and then -if nothing is returned -will relax the query by allowing an ellipsis like this: the system does not know/specify that the intermediate node z is of type review -thus a negative impact on retrieval precision might result, which we trade for recall here. in order to evaluate our main hypothesis -namely that our new interaction mechanism effectively supports users in iteratively refining an exploratory search -we performed user tests in an exploratory search scenario. to make the sessions more comparable, we pre-defined the information needs: the "story" started with a keyword search for the topic "criminal law" and was continued with some typical questions about e.g. prominent authors in that field, authors who had won prizes, their institutes, as well as books with positive reviews in that field. before each session, participants were instructed about the features of the system via a short demo. within the session, the predefined information needs were explained and users were asked to interact with the system to satisfy them. when users got stuck with interaction or query formulation, help was offered. following the popular "five-user assumption" of usability testing [18] , we recruited 5 participants, 2 colleagues from our school of business and 3 of our students. all subjects were not previously aware of our project. this selection was made for practical feasibility reasons -we are aware of the bias, in terms of user characteristics, that it introduces. participants received overall 5 different information needs (q 1 to q 5 ). the first one (q 1 ) started from a single node (the topic "criminal law"), i.e. a context selection was not required. all subsequent ones required participants to select a subset of the nodes that were currently displayed (e.g. all books or all persons). the last information need (q 5 ) was formulated in a complex way ("which authors that have written a book about criminal law have also written a review?") and required participants to recognise that a partial result to the question was already available from a previous step. we observed the participants' difficulties to (a) formulate queries that the semantic parser would understand correctly, (b) grasp the principle of breaking down complex information needs into simpler ones (here, participants would typically try to extend the previous query by adding more constraints) and to (c) remember to select a subset of nodes as a context for their next query. table 1 shows the number of participants facing these problems for each of the test queries. in terms of query reformulation, there is no clear pattern -we observed a number of ways in which our grammar can be improved. grasping the process of iterative refinement shows a clear learning curve: while two participants had understood the principle immediately from the introductory demo, the other three needed only one experience with q 2 to grasp it. we observed that the problems with q5 resulted merely from participants not accurately understanding the complex question -they both said that it would have been different if it had been their own information need. remembering to select a subset of nodes as a context was harder: while two participants never forgot to do it, one needed q 2 , another one q 2 and q 3 to remember it; one participant could not get used to it until the end of the test. the persons who struggled expressed their expectation that -if they did not select any nodes, but asked a question like "which of these persons..."the system should automatically assume that it referred to all currently visible persons. since this is easy to build into our system, we can conclude that context selection will not be an issue once the principle of iterative refinement has been grasped. besides observing the query formulation and interaction strategies of the users -including their need for help -we asked the users to give us feedback on the following points: -intuitiveness of context selection: three participants stated that they found it intuitive and natural to select a context for their query and to break down complex questions. the other two expressed their expectation for the system to identify context automatically (see above). -results of elliptic queries: queries containing "intermediate nodes", e.g. a query "show me all authors who have written about criminal law" would show not only authors, but also their books, although the question did not ask for books. only one participant had difficulties in understanding what was shown (because the legend was not clear to him). when judging the result, 4 participants said that seeing the books was interesting, especially for someone wishing to explore criminal law as a new area, while 3 participants remarked that the result was not strictly what they had asked for. two participants stated that they would appreciate to see a list of persons -in addition to the graph visualisation. -general feedback on the interaction was very positive. despite the observed difficulties that did occur with query formulation, all participants said that they were impressed with the ability of the system to understand queries in natural language. four participants mentioned explicitly that the visual representation helped them to better understand relationships and to see "how things belong together". one participant said that it sparked his curiosity to explore further. all participants stated that the interaction mechanism was either "intuitive" or at least "easy to learn" (because, as they stated, "the effect of what you do is always visible") and three of them mentioned expressly that they liked the refinement process of breaking down complex queries. participants also came forth with a number of suggestions for improvement: two participants stated that they would appreciate if the system could understand -besides fully formulated questions -keyword-based inputs. the same participants and a third one expressed their wish to have result lists, in addition to a graph. the main reason mentioned for this was the lack of a ranking provided in the graph. the participants said that they would not know where to start looking if a result graph grew too large. -comparison to traditional interfaces, especially ones with list-based result presentation: participants said that our system would be more effective in supporting "detailed investigation" that required to "understand relationships", whereas traditional list-based systems would be better suited to get an overview of e.g. the most important books on criminal law because of their clear ranking. in this work, we have proposed a novel context-aware sequential question answering system, especially suited for exploratory search, based on graph visualisation for result presentation and iterative refinement of information needs. this refinement in turn is based on the selection of subsets of nodes for context definition and natural language questions towards this context. our results are somewhat limited by the specific scenario and use case that we explored and the small user group involved. however, they do show quite clearly that users either understand the principle immediately or pick it up very quickly -and that they appreciate the possibility of exploring the information space iteratively. having to explicitly select context is hard to get used to for some, and should be automated. the visual representation of results was well received for its support of understanding relationships. on the other hand, it became clear that ranking or highlighting the more "relevant" nodes will be needed to help users focus, especially when results get larger. thus, our main goal for future work will be to investigate the best way to incorporate node scoring into the system -either visually (e.g. via node sizes) or by providing ranked result lists in addition to and linked to the graph. because of the limitations of our participant selection strategy, further test with a more varied user group will also be required. finally, it might be interesting to explore the possibility for users to combine search results (sub-graphs) of queries before exploring the combined results further. adaptive visualization for exploratory information retrieval graph querying meets hci: state of the art and future directions dialog-to-action: conversational question answering over a large-scale knowledge base quble: towards blending interactive visual subgraph search queries on large networks search-based neural structured learning for sequential question answering a survey on semantic parsing deep reinforcement learning for sequence to sequence models perseus: an interactive large-scale graph mining and visualization tool incomplete follow-up question resolution using retrieval based sequence to sequence learning how to make a natural language interface to query databases accessible to everyone: an example is question answering fit for the semantic web?: a survey exploratory search: from finding to understanding efficient processing of keyword queries over graph databases for finding effective answers complex sequential question answering: towards learning to converse over linked question answer pairs with a knowledge graph the web as a knowledge-base for answering complex questions fast generation of result snippets in web search an end-to-end neural natural language interface for databases refining the test phase of usability evaluation: how many subjects is enough visual analysis of large graphs: state-of-the-art and future research challenges exploratory search interfaces: categorization summarizing answer graphs induced by keyword queries sqlnet: generating structured queries from natural language without reinforcement learning slq: a user-friendly graph querying system spider: a large-scale human-labeled dataset for complex and crossdomain semantic parsing and text-to-sql task seq2sql: generating structured queries from natural language using reinforcement learning key: cord-020927-89c7rijg authors: zhuang, shengyao; zuccon, guido title: counterfactual online learning to rank date: 2020-03-17 journal: advances in information retrieval doi: 10.1007/978-3-030-45439-5_28 sha: doc_id: 20927 cord_uid: 89c7rijg exploiting users’ implicit feedback, such as clicks, to learn rankers is attractive as it does not require editorial labelling effort, and adapts to users’ changing preferences, among other benefits. however, directly learning a ranker from implicit data is challenging, as users’ implicit feedback usually contains bias (e.g., position bias, selection bias) and noise (e.g., clicking on irrelevant but attractive snippets, adversarial clicks). two main methods have arisen for optimizing rankers based on implicit feedback: counterfactual learning to rank (cltr), which learns a ranker from the historical click-through data collected from a deployed, logging ranker; and online learning to rank (oltr), where a ranker is updated by recording user interaction with a result list produced by multiple rankers (usually via interleaving). in this paper, we propose a counterfactual online learning to rank algorithm (coltr) that combines the key components of both cltr and oltr. it does so by replacing the online evaluation required by traditional oltr methods with the counterfactual evaluation common in cltr. compared to traditional oltr approaches based on interleaving, coltr can evaluate a large number of candidate rankers in a more efficient manner. our empirical results show that coltr significantly outperforms traditional oltr methods. furthermore, coltr can reach the same effectiveness of the current state-of-the-art, under noisy click settings, and has room for future extensions. traditional learning to rank (ltr) requires labelled data to permit the learning of a ranker: that is, a training dataset with relevance assessments for every querydocument pair is required. the acquisition of such labelled datasets presents a number of drawbacks: they are expensive to construct [5, 25] , there may be ethical issues in privacy-sensitive tasks like email search [37] , and they cannot capture changes in user's preferences [19] . the reliance on users implicit feedbacks such as clicks is an attractive alternative to the construction of editorially labelled datasets, as this data does not present the aforementioned limitations [15] . however, this does not come without its own drawbacks and challenges. user implicit feedback cannot be directly treated as (pure) relevance labels because it presents a number of biases, and part of this implicit user signal may actually be noise. for example, in web search, users often examine the search engine result page (serp) from top to bottom. thus, higher ranked documents have a higher probability to be examined, attracting more clicks (position bias), which in turn may infer these results as relevant even when they are not [7, 18, 24] . other types of biases may affect this implicit feedback including selection and presentation bias [2, 16, 40] . in addition, clicks on serp items may be due to noise, e.g., sometimes users may click for unexpected reasons (e.g., clickbaits and serendipity), and these noisy clicks may hurt the learnt ranker. hence, in order to leverage the benefits of implicit feedback, ltr algorithms have to be robust to these biases and noises. there are two main categories of approaches to learning a ranker from implicit feedback [14] : (1) offline ltr: methods in this category learn a ranker using historical clickthrough log data collected from a production system (logging ranker). a representative method in this category is counterfactual learning to rank (cltr) [18] , where a user's observation probability (known as propensity) is adopted to construct an unbiased estimator which is used as the objective function to train the ranker. (2) online ltr (oltr): methods in this category interactively optimize a ranker given the current user's interactions. a representative method in this category is dueling bandit gradient descent (dbgd) [39] , where multiple rankers are used to produce an interleaved 1 results list to display to the user and collect clicks. this signal is used to unbiasedly indicate which rankers that participated in the interleaving process are better (online evaluation) and to trigger an update of the ranker in production. the aim of the counterfactual and the online evaluations is similar: they both attempt to unbiasedly evaluate the effectiveness of a ranker and thus can provide ltr algorithms with reliable updating information. in this paper, we introduce counterfactual online learning to rank (coltr), the first online ltr algorithm that combines the key aspects of both cltr and oltr approaches to obtain an effective ranker that can learn online from user feedback. coltr uses the dbgd framework from oltr to interactively update the ranker used in production, but it uses the counterfactual evaluation mechanism of cltr in place of online evaluation. the main challenge we address is that counterfactual evaluation cannot be directly used in online learning settings because the propensity model is unknown. this is resolved by mirroring solutions developed for learning in the bandit feedback problem (and specifically the self-normalized estimator [34] ) within the considered ranking task -this provides a position-unbiased evaluation of rankers. our empirical results show that cotlr significantly improves the traditional dbgd baseline algorithm. in addition, because cotlr does not require interleaving or multileaving, which is the most computationally expensive part in online evaluation [28] , coltr is more efficient than dbgd. we also find that coltr performance is at par with the current state-of-the-art oltr method [22] under noisy click settings, while presenting a number of avenues for further improvement. the goal of counterfactual learning to rank (cltr) is to learn a ranker from historical user interaction logs obtained with the ranker used in production. an advantage of this approach is that candidate rankers are trained and evaluated offline, i.e., before being deployed in production, thus avoiding exposing users to rankers of lesser quality compared to that currently in production. however, unlike traditional supervised ltr methods [20] , users interaction data provides only partial feedback which cannot be directly treated as absolute relevance labels [14, 16] . this is because clicks may have not been observed on some results because of position or selection bias, and clicks may have instead been observed because of noise or errors. as a result, much of the prior work has focused on removing these biases and noise. according to position bias, users are more likely to click on top-ranked search results than those at the bottom of the serp [2, 16, 18] : in cltr this probability is referred to as propensity. joachims et al. [18] developed an unbiased (with respect to position) ltr that relies on clicks using a modified svmrank approach that optimizes the empirical risk computed using the inverse propensity scoring (ips) estimator. the ips is an unbiased estimator which can indicate the effectiveness of a ranker given propensity (the probability that the user will examine a document) and click data [18] . however, this approach requires a propensity model to compute the ips score. to estimate this, randomization experiments are usually required when collecting the interaction data and the propensity model is estimated under offline setting [37, 38] . aside from position bias, selection bias is also important, and it dominates problems in other ranking tasks such as recommendation and ad placement. selection bias refers to the fact that users can only interact with items presented to them. typically, in ad placement systems, the assumption is made that users examine the displayed ads with certainty if only one item is shown: thus no position bias. however, users are given the chance to click on the displayed item only, so clicks are heavily biased due to selection. user interactions with this kind of systems are referred to as bandit feedback [17, 33, 34] . the counterfactual risk minimization (crm) learning principle [33] is used to remove the bias from bandit feedback. instead of a deterministic ranker, this group of methods assume the system relies on the hypothesis that a probability distribution is available over the candidate items, which is used to sample items to show to users. importance sampling [3] is commonly used to remove selection bias. online learning to rank aims to optimize the production ranker interactively by exploiting user clicks [10, 22, 23, 29] . unlike cltr, oltr algorithms do not require a propensity model to handle position or selection bias. instead, they assume that relevant documents are more likely to receive more clicks than nonrelevant documents and exploits clicks to identify the gradient's direction. dueling bandit gradient descent (dbgd) based algorithms [39] are commonly used in oltr. the traditional dbgd uses online evaluation to unbiasedly compare two or more rankers given a user interaction [12, 29] . subsequent methods developed more reliable or more efficient online evaluation methods, including probabilistic interleaving (pigd) which has been proven to be unbiased [12] . the probabilistic multileaving extension (pmgd) [28] , compares multiple rankers at each interaction, resulting in the best dbgd-based algorithm, which reaches a better convergence given less training impressions [23] . however, this method suffers from a high computational cost because it requires sampling ranking assignments to infer outcomes. further variations that reuse historical interaction data to accelerate the learning in dbdg have also been investigated [10] . the current state-of-the-art oltr algorithm is pairwise differentiable gradient descent (pdgd) [22] , which does not require sampling candidate rankers to create interleaved results lists for online evaluation. instead, pdgd creates a probability distribution over the document set and constructs the result list by sampling documents from this distribution. then the gradients are estimated from pairwise documents preferences based on user clicks. this algorithm provides much better performance than traditional dbgd-based methods in terms of final convergence and user online experience. the proposed coltr method uses counterfactual evaluation to estimate the effectiveness of candidate rankers based on the click data collected by the logging ranker. this is unlike dbgd and other oltr methods that use interleaving. in the counterfactual learning to rank setting, the ips estimator is used to eliminate position bias [18] , providing an unbiased estimation. however, the ips estimator requires that the propensities of result documents are known. the propensity of a document is the probability that the user will examine the document. in offline ltr settings, propensities are estimated using offline click-through data, via a randomization experiment [38] . offline click-through data is not available in the online setting we consider, and thus the use of ips in such an online setting becomes a challenge. to overcome this, we adapt the counterfactual estimator used in batch learning from logged bandit feedback [32, 34] . this type of counterfactual learning treats rankers as policies and samples documents from a probability distribution to create the result list. this allows us to use importance sampling to fix the distribution mismatch between policies and to use monte carlo approximation to estimate the risk function r(f θ ): where k is the number of documents in the result list, θ is the feature weights of the logging ranker, θ is the new ranker's feature weights which need to be estimated, and δ is the reward function. following the counterfactual risk minimization (crm) learning principle [33] , we set: in counterfactual learning to rank, the user examination is modelled as propensity. in learning from logged bandit feedback, only the examined documents are considered. in the online setting, however, it is unclear how to determine which documents the user has examined (e.g. a user may have considered a snippet, but did not click on it). we make the assumption that users always examine documents from top to bottom, and thus consider the documents ranked above the one that was clicked last as having been examined. with this in place, the reward function described in eq. 2 can be used to assign rewards to documents in the result list. unlike traditional dbgd-based oltr which ranks documents according to the scores assigned by the ranking function (i.e., deterministically), coltr creates the result list to be provided to the user for gathering feedback by sampling documents from a known probability distribution. that is, document d i is drawn from a distribution p(d i |f θ , d) computed by the logging ranker θ. we use softmax to convert document scores into a probability distribution: where τ is the temperature parameter, which is commonly used in the field of reinforcement learning to control the sharpness of the probability distribution [31] . for high values of τ (τ → ∞), the distribution becomes uniform. for low values (τ → 0), the probability of the document with the highest score tends to 1. after a document has been picked, the probability distribution will be renormalized to avoid sampling duplicates. this kind of probabilistic ranker has been used in previous works [4, 14, 22] . while it has been proved that the risk estimator in eq. 1 is an unbiased estimator, it does suffer from the propensity overfitting problem [34] , i.e., the learning algorithm may learn a ranker that assigns small probability values over all the documents d i in the result list, as this can minimize the risk function. to address this problem, we use the self-normalized risk estimator r sn (f θ ) (similar to [34] ): where: intuitively, if propensity overfitting does occur, s(f θ ) will be small, giving a penalty to r sn (f θ ). following the crm principle, the aim of the learning algorithm is to find a ranker with feature weights θ that can optimize the self-normalized risk estimator, as well as its empirical standard deviation; formally: the v ar(r sn (f θ )) is the empirical variance of r sn (f θ ), to compute which we use an approximate variance estimation [27] , where λ = 1 controls the impact of empirical variance: the previous section described the counterfactual evaluation that can be used in an online learning to rank setting. next, we introduce the coltr algorithm that can leverage the counterfactual evaluation to update the current ranker weights θ t . coltr uses the dbgd framework to optimize the current production ranker, but it does not rely on interleaving or multileaving comparisons. counterfactual evaluation (inf er winners(δ t , θ t , c, l t , d t , λ)). algorithm 1 describes the coltr updating process: similar to dbgd, it requires the initial ranker weights θ 1 , the learning rate α which is used to control the update speed, and the step size η which controls the gradient size. at each timestamp t, i.e., at each round of user interactions (line 2), the search engine receives a query q t issued by a user (line 3). then the candidate document set d t is generated given q t (line 4), and the results list l t is created by sampling documents d i without replacement from the probability distribution computed by eq. 3 (line 5). the results list is then presented to the user and clicks observed. then the reward label vector δ t is generated according to eq. 2 (line 6) 2 . next, an empty candidate ranker pool c is created (line 7) and candidate rankers are generated and added to the pool (lines [8] [9] [10] [11] [12] . counterfactual evaluation is used to compute the risk associated to each ranker, as described in algorithm 2. the rankers with a risk lower than the logging ranker are said to win and are placed in the set w (line 13). finally, the current ranker weights are updated by adding the mean of the winners' unit vector (line 14) modulated by the learning rate α. the method coltr uses for computing gradients is similar to that of dbgd with multileaving (pmgd) [29] . however, coltr is more efficient. in fact, it does not need to generate an interleaved or multileaved result list for exploring user preferences. when the length of the result list is large, the computational cost for multileaving becomes considerable. in addition, using online evaluation to infer outcomes is very expensive, especially for probabilistic multileaving evaluation [28] : this type of evaluation requires sampling a large number of ranking assignments to decide which ones are the winner rankers -a computationally expensive operation. in contrast, the time complexity for counterfactual evaluation increases linearly with the number of candidate rankers (the for loop in algorithm 2, line 3 3 ). to compute the probabilities of sampling documents for the logging and new rankers (algorithm 2, line 6 and 7), the document scores in eq. 3 need to be renormalized after each rank: this attracts additional computational cost. for efficiency reasons, we approximate these probabilities by assuming independence, so that we can compute the probabilities only once 4 . as a result, coltr can efficiently compare a large number of candidate rankers at each interaction. datasets. we used four publicly available web search ltr datasets to evaluate coltr. each dataset contains query-document pair features and (graded) relevance labels. all feature values are normalised using minmax at the query level. the datasets are split into training, validation and test sets using the splits according to the datasets. the smallest datasets in our experiments are mq2007 (1,700 queries) and mq2008 (800 queries) [25] , which are a subset of letor 4.0. they rely on the gov2 collection and the query set from the trec million query track [1] . query-document pairs are represented with respect to 46 features and 3-graded relevance (from 0, not relevant, to 2, very relevant). in addition to these datasets, we use the larger mlsr-web10k [25] and yahoo! learning to rank challenge datasets [5] . data for these datasets comes from commercial search engines (bing and yahoo, respectively), and relevance labels are assigned on a five-point scale (0 to 4). mlsr-web10k contains 10,000 queries and 125 retrieved documents on average, which are represented with respect to 136 features; while, yahoo! is the largest dataset we consider, with 29,921 queries and 709,877 documents, represented using 700 features. simulating user behaviour. following previous oltr work [9, 11, 22, 23, 29, 41] , we use the cascade click model (ccm) [6, 8] to generate user clicks. this click model assumes users examine documents in the result list from top to bottom and decide to click with a probability p(click = 1|r), where r is the relevance grade of the examined document. after a document is clicked, the user may stop examining the remainder of the list with probability p(stop = 1|r). in line with previous work, we study three different user behaviours and the corresponding click models. the perfect model simulates the user who clicks on every relevant document in the result list and never clicks on non-relevant documents. the navigational model simulates the user looking for a single highly relevant document and thus is unlikely to continue after finding the first relevant one. the informational model represents the user that searches for topical information and that exhibits a much nosier click behaviour. we use the settings used by previous work for instantiating these click behaviours, e.g., see table 1 in [23] . in our experiments, the issuing of queries is simulated by uniformly sampling baselines. three baselines are considered for comparison with coltr. the traditional dbgd with probabilistic interleaving (pigd) [12] is used as a representative oltr method -note that coltr also uses dbgd, but with counterfactual evaluation in place of the interleaving method. for pigd, only one candidate ranker is sampled at each interaction; sampling occurs by randomly varying feature weights on the unit sphere with step size η = 1, and updating the current ranker with learning rate α = 0.01. the probabilistic multileaving gradient descent method (pmgd) [23] is also used in our experiments, as it is the dbgd-based method that has been reported to achieve the highest performance so far for this class of approaches [21] . for this baseline, we use the same parameters settings reported in previous work [22] , where the number of candidates was set to n = 49, step size to η = 1 and learning rate to α = 0.01. the third baseline we consider is the pairwise differentiable gradient descent (pdgd) [22] , which is the current state-of-the-art oltr method. we set pdgd's parameters according to oosterhuis et al. [22] , and specify learning rate α = 0.1 and use zero initialization. for coltr, we use η = 1. we use a learning rate decay for α: in this case, α starts at 0.1 and decreases according to α = α * 0.99966 after each update. we set the temperature parameter τ = 0.1 when sampling documents and test different numbers of candidate rankers from n = 1 to n = 999. for all experiments, we only display k = 10 documents to the user, and all methods are used to optimize a linear ranker. note, we do not directly compared with counterfactual ltr approaches like that of joachims et al. [18] because we consider an online setup (while counterfactual ltr requires a large datasets of previous interactions, and the estimation of propensity, which is unfeasible to be performed in an online setting). evaluation measures. the effectiveness of the considered oltr methods is measured with respect to both offline and online performance. for offline performance, we average the ndcg@10 scores of the production ranker over the queries in the held-out test set. this measure indicates the effectiveness of the learned ranker. the offline performance of each method is measured for 10,000 impressions, and the final offline performance is also recorded. online performance is computed as the ndcg@10 score produced by the result list displayed to the user during the training phase [13] . this measure indicates the quality of the user experience during training. a discount factor γ is used to ensure that long-term impressions have less impact, i.e. t=1 ndcg(l t ) · γ t−1 . following previous work [21] [22] [23] , we choose γ = 0.9995 so that impressions after the horizon of 10,000 have less than a 1% impact. we repeated each experiment 125 times, spread over different training folds. the evaluation results are averaged and statistically significant differences between system pairs are computed using a two-tailed t-test. we first investigate how the number of candidate rankers impacts offline performance. figure 1(a) displays the offline ndcg of coltr and the baselines under the informational click setting when a different number of candidate rankers is used by coltr (recall that pigd uses two rankers and pmgd uses 49 rankers). consider coltr with one candidate ranker in addition to the production ranker (n = 1) and pigd: both are considering a single alternative ranker to that in production. from the figure, it is clear that pigd achieves a better offline performance than coltr. however, when more candidate rankers are considered, e.g., n is increased to 49, the offline performance of coltr becomes significantly higher than that of pigd. furthermore, coltr is also better than pmgd when the same number of candidate rankers are considered. moreover, coltr allows to efficiently compare a large number of candidate rankers at each interaction (impression), and thus can test with a larger set of candidate rankers. we find that increasing the number of candidate rankers can help boosting the offline performance of coltr and achieve a higher final converge. when n = 499, coltr can reach significantly better (p < 0.01) offline performance than pdgd, the current state-of-the-art oltr method. however, beyond n = 499 there are only minor improvements in offline performance, achieved at a higher computational cost -thus, in the remaining experiments, we consider only n = 499. we also consider long-term convergence. figure 1 (b) displays the results for coltr (with n = 499) and the baselines after 100,000 impressions. because a learning rate decay is used in coltr, the learning rate becomes insignificant after 30,000 impressions. in order to prevent this to happen, we stop the learning rate decay when α < 0.01, and we leave α = 0.01 constant for the remaining impressions. the figure shows that, contrary to the results in fig. 1(a) , pmgd can reach much higher performance than pigd when enough impressions are considered -this finding is consistent with previously reported observations [22] . nevertheless, both coltr and pdgd are still significantly better than pigd and pmgd, and have similar convergence: their offline performance is less affected by the long term impressions. figure 2 displays the offline performance across datasets of varying dimensions (small: mq2007, and large: mslr-web10k) under three different click models and for 10,000 impressions. the results show that pdgd and coltr outperform pigd and pmgd for all click models. we also find that, overall, coltr and the current state-of-the-art online ltr approach, pdgd have very similar learning curves across all click models and datasets, apart for the perfect click model on the mslr-web10k dataset, for which coltr is severely outperformed by pdgd. note, the trends observed in fig. 2 found also for the majority of the remaining datasets. for space reasons, we omit these results from the paper, but we make them available as an online appendix at http://ielab. io/coltr. table 1 reports the final convergence performance for all datasets and click models (including statistical significance analysis), displaying similar trends across the considered datasets. along with the performance obtained by rankers once training is over, the user experience obtained during training should also be considered. table 2 reports the online performance of all methods, for all datasets and click models. the state-of-the-art pdgd has the best online performance across all conditions. coltr outperforms pigd and pmgd when considering the perfect click model. for other click models, coltr is better than pigd but it does provide less cumulative online performance than pmgd, even if it achieves a better offline performance. we posit that this is because pmgd uses a deterministic ranking function to create the result list the user observes, and via multileaving it guarantees that the interleaved result list is not worse than that of the worst candidate ranker. coltr instead uses a probabilistic ranking function, and if the document sampling distribution is too similar to a uniform distribution, the result list may incorrectly contain many non-relevant documents: this results in a bad online performance. a uniform sampling distribution is obtained because noisy clicks result in some candidate rankers randomly winning the counterfactual evaluation and thus slowing down the gradient convergence and achieving an "elastic effect", where the weight vectors go forward in one interaction, and backwards in the next. this will cause the margins between the documents' scores assigned by the ranking function to become too small and thus the softmax function will not generate a "deterministic" distribution. this also explains why the online performance is much better when clicks are perfect: the gradient directions corresponding to the winning candidates are likely similar, leading the current ranker moving fast through large gradient updates (no elastic effect). in this paper, we have presented a novel online learning to rank algorithm that combines the key aspects of counterfactual learning and oltr. our method, counterfactual online learning to rank (coltr), replaces online evaluation, which is the most computational expensive step in the traditional dbgd-style oltr methods, with counterfactual evaluation. coltr does not derive a gradient function and use it to optimise an objective, but still samples different rankers, akin to the online evaluation practice. as a result, coltr can evaluate a large number of candidate rankers at a much lower computational expense. our empirical results, based on publicly available web search ltr datasets, also show that the coltr can significantly outperform dbgd-style oltr methods across different datasets and click models for offline performance. we also find that coltr achieves the same offline performance as the state-ofthe-art oltr model, the pdgd, across all datasets under noisy click settings. this means coltr can provide a robust and effective ranker to be deployed into production, once trained online. however, due to the uniform sampling distribution employed by coltr to select among candidate documents, coltr has worse online performance than pmgd and pdgd. future work will investigate the difference between gradients provided by pdgd and coltr, as they both use a probabilistic ranker to create the result list. this analysis could provide further indications about the reasons why the online performance of coltr is limited. other improvements could be implemented for coltr. first, instead of stochastically learning at each interaction, historical user interaction data could be used to perform batch learning, which may provide even more reliable gradients under noisy clicks. note that this extension is possible, and methodologically simple for coltr, but not for pdgd. second, the use of the exploration variance reduction method [35, 36] could be investigated to reduce the gradient exploration space: this may solve the uniform sampling distribution problem. million query track bias on the web counterfactual reasoning and learning systems: the example of computational advertising learning to rank: from pairwise approach to listwise approach yahoo! learning to rank challenge overview click models for web search an eye tracking study of the effect of target rank on web search efficient multiple-click models in web search evaluation of methods for relative comparison of retrieval systems based on clickthroughs reusing historical interaction data for faster online learning to rank for ir balancing exploration and exploitation in learning to rank online a probabilistic method for inferring preferences from clicks fast and reliable online learning to rank for information retrieval to model or to intervene: a comparison of counterfactual and online learning to rank from user interactions optimizing search engines using clickthrough data accurately interpreting clickthrough data as implicit feedback deep learning with logged bandit feedback unbiased learning-to-rank with biased feedback online exploration for detecting shifts in fresh intent learning to rank for information retrieval balancing speed and quality in online learning to rank for information retrieval differentiable unbiased online learning to rank probabilistic multileave gradient descent in google we trust: users' decisions on rank, position, and relevance how does clickthrough data reflect retrieval quality simulation and the monte carlo method probabilistic multileave for online retrieval evaluation multileave gradient descent for fast online learning to rank multileaved comparisons for fast online evaluation reinforcement learning: an introduction batch learning from logged bandit feedback through counterfactual risk minimization counterfactual risk minimization: learning from logged bandit feedback the self-normalized estimator for counterfactual learning variance reduction in gradient exploration for online learning to rank efficient exploration of gradient space for online learning to rank learning to rank with selection bias in personal search position bias estimation for unbiased learning to rank in personal search interactively optimizing information retrieval systems as a dueling bandits problem beyond position bias: examining result attractiveness as a source of presentation bias in clickthrough data relative confidence sampling for efficient on-line ranker evaluation dr guido zuccon is the recipient of an australian research council decra research fellowship (de180101579) and a google faculty award. key: cord-020901-aew8xr6n authors: garcía-durán, alberto; gonzález, roberto; oñoro-rubio, daniel; niepert, mathias; li, hui title: transrev: modeling reviews as translations from users to items date: 2020-03-17 journal: advances in information retrieval doi: 10.1007/978-3-030-45439-5_16 sha: doc_id: 20901 cord_uid: aew8xr6n the text of a review expresses the sentiment a customer has towards a particular product. this is exploited in sentiment analysis where machine learning models are used to predict the review score from the text of the review. furthermore, the products costumers have purchased in the past are indicative of the products they will purchase in the future. this is what recommender systems exploit by learning models from purchase information to predict the items a customer might be interested in. the underlying structure of this problem setting is a bipartite graph, wherein customer nodes are connected to product nodes via ‘review’ links. this is reminiscent of knowledge bases, with ‘review’ links replacing relation types. we propose transrev, an approach to the product recommendation problem that integrates ideas from recommender systems, sentiment analysis, and multi-relational learning into a joint learning objective. transrev learns vector representations for users, items, and reviews. the embedding of a review is learned such that (a) it performs well as input feature of a regression model for sentiment prediction; and (b) it always translates the reviewer embedding to the embedding of the reviewed item. this is reminiscent of transe [5], a popular embedding method for link prediction in knowledge bases. this allows transrev to approximate a review embedding at test time as the difference of the embedding of each item and the user embedding. the approximated review embedding is then used with the regression model to predict the review score for each item. transrev outperforms state of the art recommender systems on a large number of benchmark data sets. moreover, it is able to retrieve, for each user and item, the review text from the training set whose embedding is most similar to the approximated review embedding. online retail is a growing market with sales accounting for $394.9 billion or 11.7% of total us retail sales in 2016 [35] . in the same year, e-commerce sales accounted for 41.6% of all retail sales growth [15] . for some entertainment products such as movies, books, and music, online retailers have long outperformed traditional in-store retailers. one of the driving forces of this success is the ability of online retailers to collect purchase histories of customers, online shopping behavior, and reviews of products for a very large number of users. this data is driving several machine learning applications in online retail, of which personalized recommendation is the most important one. with recommender systems online retailers can provide personalized product recommendations and anticipate purchasing behavior. in addition, the availability of product reviews allows users to make more informed purchasing choices and companies to analyze costumer sentiment towards their products. the latter was coined sentiment analysis and is concerned with machine learning approaches that map written text to scores. nevertheless, even the best sentiment analysis methods cannot help in determining which new products a costumer might be interested in. the obvious reason is that costumer reviews are not available for products they have not purchased yet. in recent years the availability of large corpora of product reviews has driven text-based research in the recommender system community (e.g. [3, 19, 21] ). some of these novel methods extend latent factor models to leverage review text by employing an explicit mapping from text to either user or item factors. at prediction time, these models predict product ratings based on some operation (typically the dot product) applied to the user and product representations. sentiment analysis, however, is usually applied to some representation (e.g. bagof-words) of review text but in a recommender system scenario the review is not available at prediction time. with this paper we propose transrev, a method that combines a personalized recommendation learning objective with a sentiment analysis objective into a joint learning objective. transrev learns vector representations for at training time, a function's parameters are learned to compute the review embedding from the word token embeddings such that the embedding of the user translated by the review embedding is similar to the product embedding. at the same time, a regression model g is trained to perform well on predicting ratings. users, items, and reviews jointly. the crucial advantage of transrev is that the review embedding is learned such that it corresponds to a translation that moves the embedding of the reviewing user to the embedding of the item the review is about. this allows transrev to approximate a review embedding at test time as the difference of the item and user embedding despite the absence of a review from the user for that item. the approximated review embedding is then used in the sentiment analysis model to predict the review score. moreover, the approximated review embedding can be used to retrieve reviews in the training set deemed most similar by a distance measure in the embedding space. these retrieved reviews could be used for several purposes. for instance, such reviews could be provided to users as a starting point for a review, lowering the barrier to writing reviews. we address the problem of learning prediction models for the product recommendation problem. a small example of the input data typical to such a machine learning system is depicted in fig. 1 . this reminds of knowledge bases, with 'reviews' replacing relation types. two nodes in a knowledge base may be joined by a number of links, each representing one relation type from a small vocabulary. here, if two nodes are connected they are linked by one single edge type, in which case it is represented by a number of words from a (very) large vocabulary. there are a set of users u, a set of items i, and a set of reviews r. each rev (u,i) ∈ r represents a review written by user u for item i. hence, rev (u,i) = [t 1 , · · · , t n ], that is, each review is a sequence of n tokens. in the following we refer to (u, rev (u,i) , i) as a triple. each such triple is associated with the review score r (u,i) given by the user u to item i. transrev embeds all users, items and reviews into a latent space where the embedding of a user plus the embedding of the review is learned to be close to the embedding of the reviewed item. it simultaneously learns a regression model to predict the rating given a review text. this is illustrated in fig. 2 . at prediction time, reviews are not available, but the modeling assumption of transrev allows to predict the review embedding by taking the difference of the embedding of the item and user. then this approximation is used as input feature of the regression model to perform rating prediction-see fig. 3 . approx. review embedding but good price" similar g( ) fig. 3 . at test time, the review embedding is approximated as the difference between the product and user embeddings. the approximated review embedding is used to predict the rating and to retrieve similar reviews. transrev embeds all nodes and reviews into a latent space r k (k is a model hyperparameter). the review embeddings are computed by applying a learnable function f to the token sequence of the review the function f can be parameterized (typically with a neural network such as a recursive or convolutional neural network) but it can also be a simple parameter-free aggregation function that computes, for instance, the element-wise average or maximum of the token embeddings. we propose and evaluate a simple instance of f where the review embedding h rev (u,i) is the average of the embeddings of the tokens occurring in the review. more formally, where v t is the embedding associated with token t and h 0 is a review bias which is common to all reviews and takes values in r k . the review bias is of importance since there are some reviews all of whose tokens are not in the training vocabulary. in these cases we have h rev (u,i) = h 0 . the learning of the item, review, and user embeddings is determined by two learning objectives. the first objective guides the joint learning of the parameters of the regression model and the review embeddings such that the regression model performs well at review score prediction where s is the set of training triples and their associated ratings, and g is a learnable regression function r k → r that is applied to the representation of the review h rev (u,i) . while g can be an arbitrary complex function, the instance of g used in this work is as follows where w are the learnable weights of the linear regressor, σ is the sigmoid function σ(x) = 1 1 + e −x , and b (u,i) is the shortcut we use to refer to the sum of the bias terms, namely the user, item and overall bias: b (u,i) = b u + b i + b 0 . later we motivate the application of the sigmoid function to the review embedding. of course, in a real-world scenario a recommender system makes rating predictions on items that users have not rated yet and, consequently, reviews are not available for those items. the application of the regression model of eq. (3) to new examples, therefore, is not possible at test time. our second learning procedure aims at overcoming this limitation by leveraging ideas from embeddingbased knowledge base completion methods. we want to be able to approximate a review embedding at test time such that this review embedding can be used in conjunction with the learned regression model. hence, in addition to the learning objective (2), we introduce a second objective that forces the embedding of a review to be close to the difference between the item and user embeddings. this translation-based modeling assumption is followed in transe [5] and several other knowledge base completion methods [10, 13] . we include a second term in the objective that drives the distance between (a) the user embedding translated by the review embedding and (b) the embedding of the item to be small where e u and e i are the embeddings of the user and item, respectively. in the knowledge base embedding literature (cf. [5] ) it is common the representations are learned via a margin-based loss, where the embeddings are updated if the score (the negative distance) of a positive triple (e.g. (berlin, located_in, germany)) is not larger than the score of a negative triple (e.g. (berlin, located_in, portugal)) plus a margin. note that this type of learning is required to avoid trivial solutions. the minimization problem of eq. (4) can easily be solved by setting e u = h rev (u,i) = e i = 0 ∀u, i. however, this kind of trivial solutions is avoided by jointly optimizing eqs. (2) and (4), since a degenerate solution like the aforementioned one would lead to a high error with respect to the regression objective (eq. (2)). the overall objective can now be written as min where λ is a term that weights the approximation loss due to the modeling assumption formalized in eq. (4). in our model, θ corresponds to the parameters w, e, v, h 0 ∈ r k and the bias terms b. at test time, we can now approximate review embeddings of (u, i) pairs not seen during training by computinĝ with the trained regression model g we can make rating predictionsr (u,i) for unseen (u, i) pairs by computinĝ contrary to training, now the regression model g is applied toĥ revu,i , instead of h revu,i , which is not available at test time. the sigmoid function of the regression function g adds a non-linear interaction between the user and item representation. without such activation function, the model would consist of a linear combination of bias terms and the (ranking of) served recommendations would be identical to all users. all parameters of the parts of the objective are jointly learned with stochastic gradient descent. more details regarding the parameter learning are contained in the experimental section. the choice of transe as underlying modeling assumption to this recommendation problem is not arbitrary. given the user and item embeddings, and without further constraints, it allows to distinctively compute the approximate review embedding via eq. (6). another popular knowledge graph embedding method is distmult [16] . in applying such modeling assumption to this problem one would obtain the approximate review embedding by solving the following optimization problem:ĥ rev (u,i) = max h (e i • e u )h, where • is the element-wise multiplication. the solution to that problem would be any vector with infinite norm. therefore, one should impose constraints in the norm of the embeddings to obtain a non-trivial solution. however, previous work [11] shows that such constraint harms performance. similarly, most of the knowledge graph embedding methods would require to impose constraints in the norm of the embeddings. the translation modeling assumption of transe facilitates the approximation of the review embedding without additional constraints, while its performance is on par with, if not better, than most of all other translation-based knowledge graph embedding methods [11] . there are three lines of research related to our work: knowledge graph completion, recommender systems and sentiment analysis. the first research theme related to transrev is knowledge graph completion. in the last years, many embedding-based methods have been proposed to infer missing relations in knowledge graphs based on a function that computes a likelihood score based on the embeddings of entities and relation types. due to its simplicity and good performance, there is a large body of work on translation-based scoring functions [5, 13] . [14] propose an approach to large-scale sequential sales prediction that embeds items into a transition space where user embeddings are modeled as translation vectors operating on item sequences. the associated optimization problem is formulated as a sequential bayesian ranking problem [28] . to the best of our knowledge, [14] is the first work in leveraging ideas from knowledge graph completion methods for recommender system. whereas transrev addresses the problem of rating prediction by incorporating review text, [14] addresses the different problem of sequential recommendation. therefore the experimental comparison to that work is not possible. in transrev the review embedding translates the user embedding to the product embedding. in [14] , the user embedding translates a product embedding to the embedding of the next purchased product. moreover, transrev gets rid of the margin-based loss (and consequently of the negative sampling) due to the joint optimization of eqs. (2) and (4), whereas [14] is formalized as a ranking problem in a similar way to [5] . subsequently, there has been additional work on translation-based models in recommender systems [25, 33] . however, these works cannot incorporate users' feedback other than ratings into the learning, which has been shown to boost performance [21] . there is an extensive body of work on recommender systems [1, 6, 29] . singular value decomposition (svd) [17] computes the review score prediction as the dot product between the item embeddings and the user embeddings plus some learnable bias terms. due to its simplicity and performance on numerous data sets-including winning solution to the netflix prize-it is still one of the most used methods for product recommendations. most of the previous research that explored the utility of review text for rating prediction can be classified into two categories. semi-supervised approaches. hft [21] was one of the first methods combining a supervised learning objective to predict ratings with an unsupervised learning objective (e.g. latent dirichlet allocation) for text content to regularize the parameters of the supervised model. the idea of combining two learning objectives has been explored in several additional approaches [3, 9, 19] . the methods differ in the unsupervised objectives, some of which are tailored to a specific domain. for example, jmars [9] outperforms hft on a movie recommendation data set but it is outperformed by hft on data sets similar to those used in our work [36] . supervised approaches. methods that fall into this category such as [31, 32] learn latent representations of users and items from the text content so as to perform well at rating prediction. the learning of the latent representations is done via a deep architecture. the approaches differences lie mainly in the neural architectures they employ. there is one crucial difference between the aforementioned methods and transrev. transrev predicts the review score based on an approximation of the review embedding computed at test time. moreover, since transrev is able to approximate a review embedding, we can use this embedding to retrieve reviews in the training set deemed most similar by a distance metric in the embedding space. similar to sentiment analysis methods, transrev trains a regression model that predicts the review rating from the review text. contrary to the typical setting in which sentiment analysis methods operate, however, review text is not available at prediction time in the recommender system setting. consequently, the application of sentiment analysis to recommender systems is not directly possible. in the simplest case, a sentiment analysis method is a linear regressor applied to a text embedding (eq. (3) ). we conduct several experiments to empirically compare transrev to state of the art methods for product recommendation. moreover, we provide some qualitative results on retrieving training reviews most similar to the approximated reviews at test time. we evaluate the various methods on data sets from the amazon product data 1 , which has been extensively used in previous works [21] [22] [23] . the data set consists of reviews and product metadata from amazon from may 1996 to july 2014. we focus on the 5-core versions (which contain at least 5 reviews for each user and item) of those data sets. there are 24 product categories from which we have randomly picked 18. as all previously mentioned works, we treat each of these resulting 18 data sets independently in our experiments. ratings in all benchmark data sets are integer values between 1 and 5. as in previous work, we randomly sample 80% of the reviews as training, 10% as validation, and 10% as test data. we remove reviews from the validation and test splits if they involve either a product or a user that is not part of the training data. we follow the same preprocessing steps for each data set. first, we lowercase the review texts and apply the regular expression "\w+" to tokenize the text data, discarding those words that appear in less than 0.1% of the reviews of the data set under consideration. for all the amazon data sets, both full reviews and short summaries (rarely having more than 30 words) are available. since classifying short documents into their sentiment is less challenging than doing the same for longer text [4] , we have used the reviews summaries for our work. we truncate these reviews to the first 200 words. for lack of space we cannot include statistics of the preprocessed data sets. we compare to the following methods: a svd matrix factorization; hft, which has not often been benchmarked in previous works; and deepconn [38] , which learns user and item representations from reviews via convolutional neural networks. we also include mpcn [34] (which stands for multi-pointer co-attention networks) in the comparison, however, as indicated in previous work [8] mpcn is a non-reproducible work 2 . therefore, we simply copy numbers from [34] , since they used the same data sets as the ones used in this work. additionally, we also include performance for transnets (t-nets) [7] , whose numbers are also copied from [34] . t-nets is similar to transrev in that it also infers review latent representations from user and item representations. different to transrev, it does not have any underlying graph-based modeling assumption among users, items and reviews. we set the dimension k of the embedding space to 16 for all methods. we evaluated the robustness of transrev to changes in sect. 4.6. alternatively, one could use off-the-shelf word embeddings (e.g. word2vec [24] or elmo [26] ), but this would require to assume the existence of a large collection of text for effectively learning good word representations in an unsupervised manner. however, such a corpus may not be available for some low-resource languages or domainspecific use cases. for transrev's parameters were randomly initialized [12] and learned with vanilla stochastic gradient descent. a single learning iteration performs sgd with all review triples in the training data and their associated ratings. for transrev we used a batch size of 64. we ran transrev for a maximum of 500 epochs and validated every 10 epochs. for svd we used the python package surprise 4 , and chose the learning rate and regularization term from the same range of values. parameters for hft were learned with l-bfgs, which was run for 2,500 learning iterations and validated every 50 iterations. for deepconn the original authors' code is not available and we used a third-party implementation 5 . we applied the default hyperparameters values for dropout and l2 regularization and used the same embedding dimension as for all other methods. all methods are validated according to the mean squared error (mse). the experimental results are listed in table 1 where the best performance is in bold font. transrev achieves the best performance on all data sets with the exception of the kindle store and automotive categories. surprisingly, hft is more competitive than more recent approaches that also take advantage of review text. most of these recent approaches do not include hft in their baselines. transrev is competitive with and often outperforms hft on the benchmark data sets under consideration. to quantify that the rating predictions made by hft and transrev are significantly different we have computed the dependent t-test for paired samples and for all data sets where transrev outperforms hft. the p-value is always smaller than 0.01. it is remarkable the low performance of deepconn, mpcn and t-nets in almost all datasets. this is in line with the findings reported in very recent work [8] , where authors' analysis reveals that deep recommender models are systematically outperformed by simple heuristic recommender methods. these results only confirm the existing problem reported in [8] . we randomly selected the 4 data sets baby, digital music, office and tools&home improvement from the amazon data and evaluated different values of k for user, item and word embedding sizes. we increase k from 4 to 64 and always validate all hyperparameters, including the regularization term. table 2 list the mse scores. we only observe small differences in the corresponding model's performances. this observation is in line with [21] . for most of the data sets the validated weighting term λ takes the value of either 0.1 or 0.25. this seems to indicate that the regression objective is more important than the modeling assumption in our task, as it directly relates to the goal of the task. the regularization term is of crucial importance to obtain good performance and largely varies across data sets, as their statistics also largely differ across data sets. review embeddings, which are learned from word embeddings, are learned to be good predictors of user ratings. as a consequence the learned word embeddings are correlated with the ratings. to visualize the correlation between words and ratings we proceed as follows. first, we assign a score to each word that is computed by taking the average rating of the reviews that contain the word. second, we compute a 2-dimensional representation of the words by applying t-sne [20] to the 16-dimensional word embeddings learned by transrev. figure 4 depicts these 2-dimensional word embedding vectors learned for the amazon beauty data set. the corresponding rating scores are indicated by the color. the clusters we discovered in fig. 4 are interpretable. they are meaningful with respect to the score, observing that the upper right cluster is mostly made up of words with negative connotations (e.g. horrible, useless. . . ), the lower left one contains neutral words (e.g. with, products. . . ) and the lower right one contains words with positive connotations (e.g. awesome, excellent. . . ). one of the characteristics of transrev is its ability to approximate the review representation at prediction time. this approximation is used to make a rating prediction, but it can also be used to propose a tentative review on which the user can elaborate on. this is related to a number of approaches [18, 27, 37] on explainable recommendations. we compute the euclidean distance between the approximated review embeddingĥ rev (u,i) and all review embeddings h rev (u,i) from the training set. we then retrieve the review text with the most similar review embedding. we investigate the quality of the tentative reviews that transrev retrieves for the beauty and digital music data sets. the example reviews listed in table 3 show that while the overall sentiment is correct in most cases, we can also observe the following shortcomings: (a) the function f chosen in our work is invariant to word ordering and, therefore, cannot learn that bigrams such as "not good" have a negative meaning. (b) despite matching the overall sentiment, the actual and retrieved review can refer to different aspects of the product (for example, "it clumps" and "gives me headaches"). related work [37] extracts aspects from reviews by applying a number of grammatical and morphological analysis tools. these aspects are used later on to explain why the model suspects that a user might be interested in a certain product. we think this type of explanation is complementary to ours, and might inspire future work. (c) reviews can be specific to a single product. a straightforward improvement could consist of retrieving only existing reviews for the specific product under consideration. table 3 . reviews retrieved from the beauty (upper) and digital music (lower) data sets. in parenthesis the ratings associated to the reviews. closest training review in embedding space skin improved (5) makes your face feel refreshed (5) love it (5) you'll notice the difference (5) best soap ever (5) i'll never change it (5) it clumps (2) gives me headaches (1) smells like bug repellent (3) pantene give it up (2) fake fake fake do not buy (1) seems to be harsh on my skin (2) saved my skin (5) not good quality (2) another great release from saliva (5) can't say enough good things about this cd (5) a great collection (5) definitive collection (5) sound nice (3) not his best nor his worst (4) a complete massacre of an album (2) some great songs but overall a disappointment (3) the very worst best of ever (1) overall a pretty big disappointment (2) what a boring moment (1) overrated but still alright (3) great cd (5) a brilliant van halen debut album (5) we believe that more sophisticated sentence and paragraph representations might lead to better results in the review retrieval task. as discussed, a promising line of research has to do with learning representations for reviews that are aspect-specific (e.g. "ease of use" or "price"). transrev is a novel approach for product recommendation combining ideas from knowledge graph embedding methods, recommender systems and sentiment analysis. transrev achieves state of the art performance on the data sets under consideration while having fewer (hyper)parameters than more recent works. most importantly, one main characteristic of transrev is its ability to approximate the review representation during inference. this approximated representation can be used to retrieve reviews in the training set that are similar with respect to the overall sentiment towards the product. such reviews can be dispatched to users as a starting point for a review, and thus lowering the barrier to writing new reviews. given the known influence of product reviews in the purchasing choices of the users [2, 30] , we think that recommender systems will benefit from such mechanism. user models: theory, method, and practice topicmf: simultaneously exploiting ratings and reviews for recommendation classifying sentiment in microblogs: is brevity an advantage? in: cikm translating embeddings for modeling multi-relational data empirical analysis of predictive algorithms for collaborative filtering transnets: learning to transform for recommendation are we really making much progress? a worrying analysis of recent neural recommendation approaches jointly modeling aspects, ratings and sentiments for movie recommendation (jmars) composing relationships with translations combining two and three-way embedding models for link prediction in knowledge bases understanding the difficulty of training deep feedforward neural networks traversing knowledge graphs in vector space translation-based recommendation knowledge base completion: baselines strike back matrix factorization techniques for recommender systems research and development in intelligent systems xxxii ratings meet reviews, a combined approach to recommend visualizing data using t-sne hidden factors and hidden topics: understanding rating dimensions with review text inferring networks of substitutable and complementary products image-based recommendations on styles and substitutes distributed representations of words and phrases and their compositionality an empirical comparison of knowledge graph embeddings for item recommendation deep contextualized word representations lit@eve : explainable recommendation based on wikipedia concept vectors factorizing personalized markov chains for next-basket recommendation fast maximum margin matrix factorization for collaborative prediction interpretable convolutional neural networks with dual local and global attention for review rating prediction representation learning of users and items for review rating prediction using attention-based convolutional neural network latent relational metric learning via memorybased attention for collaborative ranking multi-pointer co-attention networks for recommendation explaining reviews and ratings with paco: poisson additive co-clustering explicit factor models for explainable recommendation based on phrase-level sentiment analysis joint deep modeling of users and items using reviews for recommendation acknowledgements. the research leading to these results has received funding from the european union's horizon 2020 innovation action programme under grant agreement no 786741 -smooth project. this publication reflects only the author's views and the european community is not liable for any use that may be made of the information contained herein. key: cord-020909-n36p5n2k authors: papadakos, panagiotis; konstantakis, giannis title: bias goggles: graph-based computation of the bias of web domains through the eyes of users date: 2020-03-17 journal: advances in information retrieval doi: 10.1007/978-3-030-45439-5_52 sha: doc_id: 20909 cord_uid: n36p5n2k ethical issues, along with transparency, disinformation, and bias, are in the focus of our information society. in this work, we propose the bias goggles model, for computing the bias characteristics of web domains to user-defined concepts based on the structure of the web graph. for supporting the model, we exploit well-known propagation models and the newly introduced biased-pr pagerank algorithm, that models various behaviours of biased surfers. an implementation discussion, along with a preliminary evaluation over a subset of the greek web graph, shows the applicability of the model even in real-time for small graphs, and showcases rather promising and interesting results. finally, we pinpoint important directions for future work. a constantly evolving prototype of the bias goggles system is readily available. there is an increasing concern about the potential risks in the consumption of abundant biased information in online platforms like web search engines (wses) and social networks. terms like echo chambers and filter-bubbles [26] depict the isolation of groups of people and its aftereffects, that result from the selective and restrictive exposure to information. this restriction can be the result of helpful personalized algorithms, that suggest user connections or rank highly information relevant to the users' profile. yet, this isolation might inhibit the growth of informed and responsible humans/citizens/consumers, and can also be the result of malicious algorithms that promote and resurrect social, religious, ethnic, and other kinds of discriminations and stereotypes. currently, the community focus is towards the transparency, fairness, and accountability of mostly machine learning algorithms for decision-making, classification, and recommendation in social platforms like twitter. however, social platforms and wses mainly act as gateways to information published on the web as common web pages (e.g., blogs and news). unfortunately, users are unaware of the bias characteristics of these pages, except for obvious facts (e.g., a page in a political party's web site will be biased towards this party). in this work, we propose the bias goggles model, where users are able to explore the biased characteristics of web domains for a specific biased concept (i.e., a bias goggle). since there is no objective definition of what bias and biased concepts are [27] , we let users define them. for these concepts, the model computes the support and the bias score of a web domain, by considering the support of this domain for each aspect (i.e., dimension) of the biased concept. these support scores are calculated by graph-based algorithms that exploit the structure of the web graph and a set of user-defined seeds representing each aspect of bias. as a running example we will use the biased concept of greek politics, that consists of nine aspects of bias, each one representing a popular greek party, and identified by a single seed; the domain of its homepage. in a nutshell, the main contributions of this work are: -the bias goggles model for computing the bias characteristics of web domains for a user-defined concept, based on the notions of biased concepts (bcs), aspects of bias (abs), and the metrics of the support of the domain for a specific ab and bc, and its bias score for this bc, -the introduction of the support flow graph (sfg), along with graph-based algorithms for computing the ab support score of domains, that include adaptations of the independence cascade (ic) and linear threshold (lt) propagation models, and the new biased-pagerank (biased-pr) variation that models different behaviours of a biased surfer, -an initial discussion about performance and implementation issues, -some promising evaluation results that showcase the effectiveness and efficiency of the approach on a relatively small dataset of crawled pages, using the new agbr and ags metrics, -a publicly accessible prototype of bias goggles. the rest of the paper is organized as follows: the background and the related work is discussed in sect. 2, while the proposed model, and its notions and metrics are described in sect. 3. the graph-based algorithms for computing the support score of a domain for a specific ab are introduced in sect. 4. the developed prototype and related performance issues are discussed in sect. 5, while some preliminary evaluation results over a relatively small dataset of web pages are reported in sect. 6. finally, sect. 7 concludes the paper and outlines future work. social platforms have been found to strengthen users' existing biases [21] since most users try to access information that they agree with [18] . this behaviour leads to rating bubbles when positive social influence accumulates [24] and minimizes the exposure to different opinions [31] . this is also evident in wses, where the personalization and filtering algorithms lead to echo chambers and filter bubbles that reinforce bias [4, 12] . remarkably, users of search engines trust more the top-ranked search results [25] and biased search algorithms can shift the voting preferences of undecided voters by as much as 20% [8] . there is an increasingly growing number of discrimination reports regarding various protected attributes (e.g., race, gender, etc.) in various domains, like in ads [7, 29] and recommendation systems [13] , leading to efforts for defining principles of accountable 1 , auditing [28] and de-bias algorithms [1] , along with fair classifiers [6, 14, 34] . tools that remove discriminating information 2 , flag fake news 3 , make personalization algorithms more transparent 4 , or show political biases in social networks 5 also exist. finally, a call for equal opportunities by design [16] has been raised regarding the risks of bias in the stages of the design, implementation, training and deployment of data-driven decision-making algorithms [3, 11, 20] . there are various efforts for measuring bias in online platforms [27] . bias in wses has been measured as the deviation from the distribution of the results of a pool of search engines [23] and the coverage of srps towards us sites [30] . furthermore, the presence of bias in media sources has been explored through human annotations [5] , by exploiting affiliations [32] , the impartiality of messages [33] , the content and linked-based tracking of topic bias [22] , and the quantification of data and algorithmic bias [19] . however, this is the first work that provides a model that allows users to explore the available web sources based on their own definitions of biased concepts. the approach exploits the web graph structure and can annotate web sources with bias metrics on any online platform. below we describe the notions of biased concepts (bcs) and aspects of bias (abs), along with the support of a domain for an ab and bc, and its bias score for a bc. table 1 describes the used notation. the interaction with a user begins with the definition of a biased concept (bc), which is considered the goggles through which the user wants to explore the web domains. bcs are given by users and correspond to a concept that can range from a very abstract one (e.g., god) to a very specific one (e.g., political parties). for each bc, it is required that the users can identify at least two aspects of bias (abs), representing its bias dimensions. abs are given by the users and correspond to a non-empty set of seeds (i.e., domains) s, that the user considers to fully table 1 . description of the used notation. the first part describes the notation used for the web graph, while the second the notation for the proposed model. an |a|-dimensional vector with support 1 in all dimensions support this bias aspect. for example, consider the homepage of a greek political party as an aspect of bias in the biased concept of the politics in greece. notice, that an ab can be part of more than one bcs. typically, an ab is denoted by ab sign(s) , where sign(s) is the signature of the non-empty set of seeds s. the sign(s) is the sha1 hash of the lexicographic concatenation of the normalized second level domains (slds) 6 of the urls in s. we assume that all seeds in s are incomparable and support with the same strength this ab. the domains in the set of seeds s are incomparable and equally supportive of the ab sign (s) . the user-defined bc of the set of abs a ⊆ a u , where |a| ≥ 2 and a u the universe of all possible abs in the set of domains doms(w) of the crawled pages w, is denoted by bc a and is represented by the pair < d a , desc a >. the d a is an |a|-dimensional vector with |a| ≥ 2, holding all ab sign(s) ∈ a of this bc in lexicographic order. desc a is a user-defined textual description of this bc. in this work, we assume that all abs of any bc are orthogonal and unrelated. abs in a userdefined bc are considered orthogonal. using the notation, our running example is denoted as where d r is a vector that holds lexicographically the sha1 signatures of the nine abs singleton seeds of greek political parties r = { {"anexartitoiellines.gr"}, {"antidiaploki.gr"}, {"elliniki − lisi.gr"}, {"kke.gr"}, {"mera25.gr"}, {"nd.gr"}, {"syriza.gr"}, {"topotami.gr"}, {"xryshaygh.com"}}, and desc r = "politics in greece" is its description. a core metric in the proposed model is the support score of a domain dom to an aspect of bias ab sign(s) , denoted as sup(ab sign(s) , dom). the support score ranges in [0, 1], where 0 denotes an unsupportive domain for the corresponding ab, and 1 a fully supportive one. we can identify three approaches for computing this support for a dataset of web pages: (a) the graph-based ones that exploit the web graph structure and the relationship of a domain with the domains in seeds(ab sign(s) ), (b) the content-based ones that consider the textual information of the respective web pages, and (c) the hybrid ones that take advantage of both the graph and the content information. in this work, we focus only on graph-based approaches and study two frequently used propagation models, the independence cascade (ic) and linear threshold (lt) models, along with the newly introduced biased-pagerank (biased-pr), that models various behaviours of biased surfers. the details about these algorithms are given in sect. 4. in the same spirit, we are interested about the support of a specific domain dom to a biased concept bc a , denoted by sup(bc a , dom). the basic intuition is that we need a metric that shows the relatedness and support to all or any of the aspects in a, which can be interpreted as the relevance of this domain with any of the aspects of the biased concept bc a . a straightforward way to measure it, is the norm of the s dom d a vector that holds the support scores of dom for each ab in a, normalized by the norm of the 1 |a| vector. this vector holds the support scores of a 'virtual' domain that fully supports all bias aspects in bc a . specifically, 1] . by using the above formula two domains might have similar support scores for a specific bc, while the support scores for the respective aspects might differ greatly. for example, consider two domains dom and dom , with dom fully supporting only one aspect in a and dom fully supporting another aspect in a. then sup(bc a , dom) ∼ sup(bc a , dom ). below we introduce the bias score of a domain regarding a specific bc, as a way to capture the leaning of a domain to specific abs of a bc. the bias score of a domain regarding a bc tries to capture how biased the domain is over any of its abs, and results from the support scores that the domain has for each aspect of the bc. for example, consider a domain dom that has a rather high support for a specific ab, but rather weak ones for the rest abs of a specific bc. this domain is expected to have a high bias score. on the other hand, the domain dom that has similar support for all the available abs of a bc can be considered to be unbiased regarding this specific bc. we define the bias score of a domain dom for bc a as the distance of the s dom we use the cosine similarity to define the distance metric, as shown below: in this section, we discuss the graph-based algorithms that we use for computing the support score of a domain regarding a specific ab. we focus on the popular independence cascade (ic) and linear threshold (lt) propagation models, along with the newly introduced biased-pagerank (biased-pr) algorithm. let w be the set of crawled web pages, doms(w) the set of normalized slds in w, links(w) the set of crawled links between the domains in doms(w), and g(w) the corresponding graph with doms(w) as nodes and links(w) as edges. with link dom,dom we denote a link from domain dom to dom | dom, dom ∈ doms(w), while inv(link dom,dom ) inverses the direction of a link and inv(links(w)) is the set of inverse links in w. furthermore, for the links we assume that: although the above assumption might not be precise, since links from a web page to another are not always of supportive nature (e.g., a web page critizing another linked one), or of the same importance (e.g., links in the homepage versus links deeply nested in a site), it suffices for the purposes of this first study of the model. identification of the nature of links and the importance of the pages they appear is left as future work. given that the assumption holds, part or whole of the support of dom regarding any ab can flow to dom through inv(link dom,dom ). specifically, we define the support flow graph as: support flow graph (sfg) definition. the sfg of a set of web pages w is the weighted graph that is created by inversing the links in g(w) (i.e., the graph with doms(w) as nodes and inv(links(w)) as edges). the weight of each edge is w dom,dom = outinvlinks(dom,dom ) so, given an sfg(w) and the seeds(ab sign(s) ) of an ab we can now describe how the support flows in the nodes of the sfg(w) graph. all algorithms described below return a map m holding sup(ab sign(s) , dom) ∀ dom ∈ doms(w). the ic propagation model was introduced by kempe et al. [17] , and a number of variations have been proposed in the bibliography. below, we describe the basic form of the model as adapted to our needs. in the ic propagation model, we run n experiments. each run starts with a set of activated nodes, in our case the seeds(ab sign(s) ), that fully support the ab sign(s) . in each iteration there is a history independent and non-symmetric probability of activating the neighbors of the activated nodes associated with each edge, flowing the support to the neighbors of the activated nodes in the sfg(w). this probability is represented by the weights of the links of an activated node to its neighbors, and each node, once activated, can then activate its neighbors. the nodes and their neighbors are selected in arbitrary order. each experiment stops when there are no new activated nodes. after n runs we compute the average support score of nodes, i.e., sup(ab sign(s) , dom) ∀ dom ∈ doms(w). the algorithm is given in algorithm 1. the lt model is another widely used propagation model. the basic difference from the ic model is that for a node to become active we have to consider the support of all neighbors, which must be greater than a threshold θ ∈ [0, 1], serving as the resistance of a node to its neighbors joint support. again, we use the support probabilities represented by the weights of the sfg links. the full algorithm, which is based on the static model introduced by goyal et al. [10] , is given in algorithm 2. in each experiment the thresholds θ get a random value. we introduce the biased-pr variation of pagerank [9] that models a biased surfer. the biased surfer always starts from the biased domains (i.e., the seeds of an ab), and either visits a domain linked by the selected seeds or one of the biased domains again, with some probability that depends on the modeled behaviour. the same process is followed in the next iterations. the biased-pr differs to the original pagerank in two ways. the first one is how the score (support in our case) of the seeds is computed at any step. the support of all domains is initially 0, except from the support of the seeds that have the value init seeds = 1. at any step, the support of each seed is the original pagerank value, increased by a number that depends on the behaviour of the biased surfer. we have considered three behaviours: (a) the strongly supportive (ss) one, where the support is increased by init seeds and models a constantly strongly biased surfer, (b) the decreasingly supportive (ds) one, where the support is increased by init seeds /iter, modeling a surfer that becomes less biased the more pages he/she visits, and (c) the non-supportive (ns) one, with no increment, modeling a surfer that is biased only on the initial visiting pages, and afterwards the support score is computed as in the original pagerank. biased-pr differs also on how the biased surfer is teleported to another domain when he/she reaches a sink (i.e., a domain that has no outgoing links). the surfer randomly teleports with the same probability to a domain in any distance from the seeds. if a path from a node to any of the seeds does not exist, the distance of the node is the maximum distance of a connected node increased by one. since the number of nodes at a certain distance from the seeds increase as we move away from the seeds, the teleporting probability for a node is greater the closer the node is to the seeds. we expect slower convergence for biased-pr than the original pagerank, due to the initial zero scores of non-seed nodes. the algorithm is given in algorithm 3. due to size restrictions we provide a rather limited discussion about the complexities and the cost of tuning the parameters of each algorithm. the huge scale of the web graph has the biggest performance implication to the the graph-based computation of the abs support. what is encouraging though, is that the algorithms are applied over the compact sfg graph, that contains the slds of the pages and their corresponding links. the complexity of ic is in o(n * |domsw| * |dom(links(w)|), where n is the number of experiments. lt is much slower though since we have to additionally consider the joint support of the neighbors of a node. finally, the biased-pr converges slower than the original pagerank, since the algorithm begins only with the seeds, spreading the support to the rest nodes. also, we must consider the added cost of computing the shortest paths of the nodes from the seeds. for the relatively small sfg used in our study (see sect. 6), the ss converges much faster than the ds and ns, which need ten times more iterations. for newly introduced abs though, the computation of the support scores of the domains can be considered an offline process. users can submit abs and bcs into the bias goggles system and get notified when they are ready for use. however, what is important is to let users explore in real-time the domains space for any precomputed and commonly used bcs. this can be easily supported by providing efficient ways to store and retrieve the signatures of already known bcs, along with the computed support scores of domains of available abs. inverted files and trie-based data structures (e.g., the space efficient burst-tries [15] and the cache-consious hybrid or pure hat-tries [2] ) over the slds and the signatures of the abs and bcs, can allow the fast retrieval of offsets in files where the support scores and the related metadata are stored. given the above, the computation of the bias score and the support of a bc for a domain is lightning fast. we have implemented a prototype 7 that allows the exploration of predefined bcs over a set of mainly greek domains. the prototype offers a rest api for retrieving the bias scores of the domains, and exploits the open-source project crawler4j 8 . we plan to improve the prototype, by allowing users to search and ingest bcs, abs and domains of interest, and develop a user-friendly browser plugin on top of it. evaluating such a system is a rather difficult task, since there are no formal definitions of what bias in the web is, and there are no available datasets for evaluation. as a result, we based our evaluation over bcs for which it is easy to find biased sites. we used two bcs for our experiments, the greek politics (bc1) with 9 abs, and the greek football (bc2) with 6 abs. for these bcs, we gathered well known domains, generally considered as fully supportive of only one of the abs, without inspecting though their link coverage to the respective seeds, to avoid any bias towards our graph based approach. furthermore, we did not include the original seeds to this collection. in total, we collected 50 domains for bc1 and 65 domains for bc2, including newspapers, radio and television channels, blogs, pages of politicians, etc. this collection of domains is our gold standard. we crawled a subset of the greek web by running four instances of the crawler: one with 383 sites related to the greek political life, one with 89 sport related greek sites, one with the top-300 popular greek sites according to alexa, and a final one containing 127 seeds related to big greek industries. we black-listed below we report the results of our experiments over an i7-5820k 3.3ghz system, with 6 cores, 15mb cache and 16gb ram memory, and a 6tb disk. for each of the two bcs and for each algorithm, we run experiments for various iterations n and biased-pr variations, for the singleton abs of the 9 political parties and 6 sports teams. for biased-pr we evaluate all possible behaviours of the surfer using the parameters θ conv = 0.001 and d = 0.85. we also provide the average number of iterations for convergence over all abs for biased-pr. we report the run times in seconds, along with the metrics average golden bias ratio (agbr) and average golden similarity (ags), that we introduce in this work. the agbr is the ratio of the average bias score of the golden domains, as computed by the algorithms for a specific bc, divided by the average bias score of all domains for this bc. the higher the value, the more easily we can discriminate the golden domains from the rest. on the other hand, the ags is the average similarity of the golden domains to their corresponding abs. the higher the similarity value, the more biased the golden domains are found to be by our algorithms towards their aspects. a high similarity score though, does not imply high support for the golden domains or high disimilarity for the rest. the perfect algorithm will have high values for all metrics. the results are shown in table 2 . the difference in bc1 and bc2 results implies a less connected graph for bc2 (higher agbr values for bc2), where the support flows to less domains, but with a greater interaction between domains supporting different aspects (smaller ags values). what is remarkable is the striking time performance of ic, suggesting that it can be used in real-time and with excellent results (at least for agbr). on the other hand, the lt is a poor choice, being the slowest of all and dominated in any aspect by ic. regarding the biased-pr only the ss variation offers exceptional performance, especially for ags. the ds and ns variations are more expensive and have the worst results regarding agbr, especially the nss that avoids bias. in most cases, algorithms benefit from more iterations. the ss variation of biased-pr needs only 40 iterations for bc1 and 31 for bc2 to converge, proving that less nodes are affected by the seeds in bc2. generally, the ic and the ss variation of biased-pr are the best options, with the ic allowing the real-time ingestion of abs. but, we need to evaluate the algorithms in larger graphs and for more bcs. we also manually inspected the top domains according to the bias and support scores for each algorithm and each bc. generally the support scores of the domains were rather low, showcasing the value of other support cues, like the content and the importance of pages that links appear in. in the case of bc1, except from the political parties, we found various blogs, politicians homepages, news sites, and also the national greek tv channel, being biased to a specific political party. in the case of bc2 we found the sport teams, sport related blogs, news sites, and a political party being highly biased towards a specific team, which is an interesting observation. in both cases we also found various domains with high support to all abs, suggesting that these domains are good unbiased candidates. currently, the bias goggles system is not able to pinpoint false positives (i.e pages with non supportive links) and false negatives (i.e., pages with content that supports a seed without linking to it), since there is no content analysis. we are certain that such results can exist, although we were not able to find such an example in the top results of our study. furthermore, we are not able to distinguish links that can frequently appear in users' content, like in the signatures of forum members. in this work, we introduce the bias goggles model that facilitates the important task of exploring the bias characteristics of web domains to user-defined biased concepts. we focus only on graph-based approaches, using popular propagation models and the new biased-pr pagerank variation that models biased surfers behaviours. we propose ways for the fast retrieval and ingestion of aspects of bias, and offer access to a developed prototype. the results show the efficiency of the approach, even in real-time. a preliminary evaluation over a subset of the greek web and a manually constructed gold standard of biased concepts and domains, shows promising results and interesting insights that need futher research. in the future, we plan to explore variations of the proposed approach where our assumptions do not hold. for example, we plan to exploit the supportive, neutral or oppositive nature of the available links, as identified by sentiment analysis methods, along with the importance of the web pages they appear in. contentbased and hybrid approaches for computing the support scores of domains are also in our focus, as well as the exploitation of other available graphs, like the graph of friends, retweets, etc. in addition interesting aspects include how the support and bias scores of multiple bcs can be composed, providing interesting insights about possible correlations of different bcs, as well as how the bias scores of domains change over time. finally, our vision is to integrate the approach in a large scale wse/social platform/browser, in order to study how users define bias, create a globally accepted gold standard of bcs, and explore how such tools can affect the consumption of biased information. in this way, we will be able to evaluate and tune our approach in real-life scenarios, and mitigate any performance issues. de-biasing user preference ratings in recommender systems hat-trie: a cache-conscious trie-based data structure for strings man is to computer programmer as woman is to homemaker? debiasing word embeddings bias in algorithmic filtering and personalization fair and balanced? quantifying media bias through crowdsourced content analysis algorithmic decision making and the cost of fairness fairness through awareness the search engine manipulation effect (seme) and its possible impact on the outcomes of elections pagerank beyond the web learning influence probabilities in social networks algorithmic bias: from discrimination discovery to fairness-aware data mining measuring personalization of web search measuring price discrimination and steering on e-commerce web sites equality of opportunity in supervised learning burst tries: a fast, efficient data structure for string keys big data: a report on algorithmic systems, opportunity, and civil rights. executive office of the president maximizing the spread of influence through a social network events and controversies: influences of a shocking news event on information seeking quantifying search bias: investigating sources of bias for political searches in social media the tyranny of data? the bright and dark sides of data-driven decision-making for social good is twitter a public sphere for online conflicts? a cross-ideological and cross-hierarchical look biaswatch: a lightweight system for discovering and tracking topic-sensitive opinion bias in social media measuring search engine bias social influence bias: a randomized experiment in google we trust: users' decisions on rank, position, and relevance the filter bubble: what the internet is hiding from you on measuring bias in online information auditing algorithms: research methods for detecting discrimination on internet platforms. data and discrimination: converting critical concerns into productive inquiry risk, race, and recidivism: predictive bias and disparate impact search engine coverage bias: evidence and possible causes secular vs. islamist polarization in egypt on twitter quantifying political leaning from tweets and retweets message impartiality in social media discussions fairness beyond disparate treatment & disparate impact: learning classification without disparate mistreatment key: cord-020885-f667icyt authors: sharma, ujjwal; rudinac, stevan; worring, marcel; demmers, joris; van dolen, willemijn title: semantic path-based learning for review volume prediction date: 2020-03-17 journal: advances in information retrieval doi: 10.1007/978-3-030-45439-5_54 sha: doc_id: 20885 cord_uid: f667icyt graphs offer a natural abstraction for modeling complex real-world systems where entities are represented as nodes and edges encode relations between them. in such networks, entities may share common or similar attributes and may be connected by paths through multiple attribute modalities. in this work, we present an approach that uses semantically meaningful, bimodal random walks on real-world heterogeneous networks to extract correlations between nodes and bring together nodes with shared or similar attributes. an attention-based mechanism is used to combine multiple attribute-specific representations in a late fusion setup. we focus on a real-world network formed by restaurants and their shared attributes and evaluate performance on predicting the number of reviews a restaurant receives, a strong proxy for popularity. our results demonstrate the rich expressiveness of such representations in predicting review volume and the ability of an attention-based model to selectively combine individual representations for maximum predictive power on the chosen downstream task. multimodal graphs have been extensively used in modeling real-world networks where entities interact and communicate with each other through multiple information pathways or modalities [1, 23, 31] . each modality encodes a distinct view of the relation between nodes. for example, within a social network, users can be connected by their shared preference for a similar product or by their presence in the same geographic locale. each of these semantic contexts links the same user set with a distinct edge set. such networks have been extensively used for applications like semantic proximity search in existing interaction networks [7] , augmenting semantic relations between entities [36] , learning interactions in an unsupervised fashion [3] and augmenting traditional matrix factorization-based collaborative filtering models for recommendation [27] . each modality within a multimodal network encodes a different semantic relation and exhibits a distinct view of the network. while such views contain relations between nodes based on interactions within a single modality, observed outcomes in the real-world are often a complex combination of these interactions. therefore, it is essential to compose these complementary interactions meaningfully to build a better representation of the real world. in this work, we examine a multimodal approach that attempts to model the review-generation process as the end-product of complex interactions within a restaurant network. restaurants share a host of attributes with each other, each of which may be treated as a modality. for example, they may share the same neighborhood, the same operating hours, similar kind of cuisine, or the same 'look and feel'. furthermore, each of these attributes only uncovers a specific type of relation. for example, a view that only uses the location-modality will contain venues only connected by their colocation in a common geographical unit and will prioritize physical proximity over any other attribute. broadly, each of these views is characterized by a semantic context and encodes modality-specific relations between restaurants. these views, although informative, are complementary and only record associations within the same modality. while each of these views encodes a part of the interactions within the network, performance on a downstream task relies on a suitable combination of views pertinent to the task [5] . in this work, we use metapaths as a semantic interface to specify which relations within a network may be relevant or meaningful and worth investigating. we generate bimodal low-dimensional embeddings for each of these metapaths. furthermore, we conjecture that their relevance on a downstream task varies with the nature of the task and that this task-specific modality relevance should be learned from data. in this work, -we propose a novel method that incorporates restaurants and their attributes into a multimodal graph and extracts multiple, bimodal low dimensional representations for restaurants based on available paths through shared visual, textual, geographical and categorical features. -we use an attention-based fusion mechanism for selectively combining representations extracted from multiple modalities. -we evaluate and contrast the performance of modality-specific representations and joint representations for predicting review volume. the principle challenge in working with multimodal data revolves around the task of extracting and assimilating information from multiple modalities to learn informative joint representations. in this section, we discuss prior work that leverages graph-based structures for extracting information from multiple modalities, focussing on the auto-captioning task that introduced such methods. we then examine prior work on network embeddings that aim to learn discriminative representations for nodes in a graph. graph-based learning techniques provide an elegant means for incorporating semantic similarities between multimedia documents. as such, they have been used for inference in large multimodal collections where a single modality may not carry sufficient information [2] . initial work in this domain was structured around the task of captioning unseen images using correlations learned over multiple modalities (tag-propagation or auto-tagging). pan et al. use a graph-based model to discover correlations between image features and text for automatic image-captioning [21] . urban et al. use an image-context graph consisting of captions, image features and images to retrieve relevant images for a textual query [32] . stathopoulos et al. [28] build upon [32] to learn a similarity measure over words based on their co-occurrence on the web and use these similarities to introduce links between similar caption words. rudinac et al. augment the image-context graph with users as an additional modality and deploy it for generating visual-summaries of geographical regions [25] . since we are interested in discovering multimodal similarities between restaurants, we use a graph layout similar to the one proposed by pan et al. [21] for the image auto-captioning task but replace images with restaurants as central nodes. other nodes containing textual features, visual features and users are retained. we also add categorical information like cuisines as a separate modality, allowing them to serve as semantic anchors within the representation. graph representation learning aims to learn mappings that embed graph nodes in a low-dimensional compressed representation. the objective is to learn embeddings where geometric relationships in the compressed embedding space reflect structural relationships in the graph. traditional approaches generate these embeddings by finding the leading eigenvectors from the affinity matrix for representing nodes [16, 24] . with the advent of deep learning, neural networks have become increasingly popular for learning such representations, jointly, from multiple modalities in an end-to-end pipeline [4, 11, 14, 30, 34] . existing random walk-based embedding methods are extensions of the random walks with restarts (rwr) paradigm. traditional rwr-based techniques compute an affinity between two nodes in a graph by ascertaining the steadystate transition probability between them. they have been extensively used for the aforementioned auto-captioning tasks [21, 25, 28, 32] , tourism recommendation [15] and web search as an integral part of the pagerank algorithm [20] . deep learning-based approaches build upon the traditional paradigm by optimizing the co-occurrence statistics of nodes sampled from these walks. deepwalk [22] uses nodes sampled from short truncated random walks as phrases to optimize a skip-gram objective similar to word2vec [17] . similarly, node2vec augments this learning paradigm with second-order random walks parameterized by exploration parameters p and q which control between the importance of homophily and structural equivalence in the learnt representations [8] . for a homogeneous network, random walk based methods like deepwalk and node2vec assume that while the probabilities of transitioning from one node to another can be different, every transition still occurs between nodes of the same type. for heterogeneous graphs, this assumption may be fallacious as all transitions do not occur between nodes of the same type and consequently, do not carry the same semantic context. indeed, our initial experiments with node2vec model suggest that it is not designed to handle highly multimodal graphs. clements et al. [5] demonstrated that in the context of content recommendation, the importance of modalities is strongly task-dependent and treating all edges in heterogeneous graphs as equivalent can discard this information. metapath2vec [6] remedies this by introducing unbiased walks over the network schema specified by a metapath [29] , allowing the network to learn the semantics specified by the metapath rather than those imposed purely by the topology of the graph. metapath-based approaches have been extended to a variety of other problems. hu et al. use an exhaustive list of semantically-meaningful metapaths for extracting top-n recommendations with a neural co-attention network [10] . shi et al. use metapath-specific representations in a traditional matrix factorization-based collaborative filtering mechanism [27] . in this work, we perform random walks on sub-networks of a restaurant-attribute network containing restaurants and attribute modalities. these attribute modalities may contain images, text or categorical features. for each of these sub-networks, we perform random walks and use a variant of the heterogeneous skip-gram objective introduced in [6] to generate low-dimensional bimodal embeddings. bimodal embeddings have several interesting properties. training relations between two modalities provide us with a degree of modularity where modalities can be included or held-out from the prediction model without affecting others. it also makes training inexpensive as the number of nodes when only considering two modalities is far lower than in the entire graph. in this section, we begin by providing a formal introduction to graph terminology that is frequently referenced in this paper. we then move on to detail our proposed method illustrated in fig. 1 . formally, a heterogeneous graph is denoted by g = (v, e, φ, σ) where v and e denote the node and edge sets respectively. for every node and edge, there exists mapping functions φ(v) → a and σ(e) → r where a and r are sets of node types and edge types respectively such that |a + r| > 2. for a heterogeneous graph g = (v, e, φ, σ), a network schema is a metagraph m g = (a, r) where a is the set of node types in v and r is the set of edge types in e. a network schema enumerates the possible node types and edge types that can occur within a network. a metapath m(a 1 , a n ) is a path on the network schema m g consisting of a sequence of ordered edge transitions: we use tripadvisor to collect information for restaurants in amsterdam. each venue characteristic is then embedded as a separate node within a multimodal graph. in the figure above r nodes denote restaurants, i nodes denote images for a restaurant, d nodes are review documents, a nodes are categorical attributes for restaurants and l nodes are locations. bimodal random walks are used to extract pairwise correlations between nodes in separate modalities which are embedded using a heterogeneous skip-gram objective. finally, an attention-based fusion model is used to combine multiple embeddings together to regress the review volume for restaurants. let g = (v, e) be the heterogeneous graph with a set of nodes v and edges e. we assume the graph to be undirected as linkages between venues and their attributes are inherently symmetric. below, we describe the node types used to construct the graph (cf. figs. 1 and 2 and use the penultimate layer output as a compressed low-dimensional representation for the image. since the number of available images for each venue may vary dramatically depending on its popularity, adding a node for every image can lead to an unreasonably large graph. to mitigate this issue, we cluster image features for each restaurant using the k-means algorithm and use the cluster centers as representative image features for a restaurant, similar to zahálka et al. [35] . we chose k = 5 as a reasonable trade-off between the granularity of our representations and tractability of generating embeddings for this modality. the way patrons write about a restaurant and the usage of specialized terms can contain important information about a restaurant that may be missing from its categorical attributes. for example, usage of the indian cottage cheese 'paneer' can be found in similar cuisine types like nepali, surinamese, etc. and user reviews talking about dishes containing 'paneer' can be leveraged to infer that indian and nepali cuisines share some degree of similarity. to model such effects, we collect reviews for every restaurant. since individual reviews may not provide a comprehensive unbiased picture of the restaurant, we chose not to treat them individually, but to consider them as a single document. we then use a distributed bag-ofwords model from [13] to generate low-dimensional representations of these documents for each restaurant. since the reviews of a restaurant can widely vary based on its popularity, we only consider the 10 most recent reviews for each restaurant to prevent biases from document length getting into the model. 6. users: since tripadvisor does not record check-ins, we can only leverage explicit feedback from users who chose to leave a review. we add a node for each of the users who visited at least two restaurants in amsterdam and left a review. similar to [25, 28, 32] , we introduce two kinds of edges in our graph: 1. attribute edges: these are heterogeneous edges that connect a restaurant node to the nodes of its categorical attributes, image features, review features and users. in our graph, we instantiate them as undirected, unweighted edges. 2. similarity edges: these are homogeneous edges between the feature nodes within a single modality. for image features, we use a radial basis function as a non-linear transformation of the euclidean distances between image feature vectors. for document vectors, we use cosine similarity to find restaurants with similar reviews. adding a weighted similarity edge between every node in the same modality would yield an extremely dense adjacency matrix. to avoid this, we only add similarity links between a node and its k nearest neighbors in each modality. by choosing the nearest k neighbors, we make our similarity threshold adaptive allowing it to adjust to varying scales of distance in multiple modalities. metapaths can provide a modular and simple interface for injecting semantics into the network. since metapaths, in our case, are essentially paths over the modality set, they can be used to encode inter-modality correlations. in this work, we generate embeddings with two specific properties: 1. all metapaths are binary and only include transitions over 2 modalities. since venues/restaurants are always a part of the metapath, we only include one other modality. 2. during optimization, we only track the short-range context by choosing a small window size. window size is the maximum distance between the input node and a predicted node in a walk. in our model, walks over the metapath only capture short-range semantic contexts and the choice of a larger window can be detrimental to generalization. for example, consider a random walk over the restaurant -cuisine -restaurant metapath. in the sampled nodes below, restaurants are in red while cuisines are in blue. optimizing over a large context window can lead to mcdonald's (fast-food cuisine) and kediri (indonesian cuisine) being placed close in the embedding space. this is erroneous and does not capture the intended semantics which should bring restaurants closer only if they share the exact attribute. we use the metapaths in table 1 to perform unbiased random walks on the graph detailed in sect. 3.2. each of these metapaths enforces similarity based on certain semantics. we train separate embeddings using the heterogeneous skip-gram objective similar to [6] . for every metapath, we maximize the probability of observing the heterogeneous context n a (v) given the node v. in eq. (3) , a m is the node type-set and v m is the node-set for metapath m. arg max θ v∈vm a∈am ca∈na (v) log p(c a |v; θ) the original metapath2vec model [6] uses multiple metapaths [29] to learn separate embeddings, some of which perform better than the others. on the dblp metapath-specific embeddings fig. 3 . attention-weighted modality fusion: metapath-specific embeddings are fed into a common attention mechanism that generates an attention vector. each modality is then reweighted with the attention vector and concatenated. this joint representation is then fed into a ridge regressor to predict the volume of ratings for each restaurant. bibliographic graph that consists of authors (a), papers (p) and venues (v), the performance of their recommended metapath 'a-p-v-p-a' was empirically better than the alternative metapath 'a-p-a' on the node classification task. at this point, it is important to recall that in our model, each metapath extracts a separate view of the same graph. these views may contain complementary information and it may be disadvantageous to only retain the best performing view. for an optimal representation, these complementary views should be fused. in this work, we employ an embedding-level attention mechanism similar to the attention mechanism introduced in [33] that selectively combines embeddings based on their performance on a downstream task. assuming s to be the set of metapath-specific embeddings for metapaths m 1 , m 2 , . . . , m n , following the approach outlined in fig. 3 , we can denote it as: we then use a two-layer neural network to learn an embedding-specific attention a mn for metapath m n : further, we perform a softmax transformation of the attention network outputs to an embedding-specific weight finally, we concatenate the attention-weighted metapath-specific embeddings to generate a fused embedding we evaluate the performance of the embedding fusion model on the task of predicting the volume (total count) of reviews received by a restaurant. we conjecture that the volume of reviews is an unbiased proxy for the general popularity and footfall for a restaurant and is more reliable than indicators like ranking or ratings which may be biased by tripadvisor's promotion algorithms. we use the review volume collected from tripadvisor as the target variable and model this task as a regression problem. data collection. we use publicly-available data from tripadvisor for our experiments. to build the graph detailed in sect. 3.2, we collect data for 3,538 restaurants in amsterdam, the netherlands that are listed on tripadvisor. we additionally collect 168,483 user-contributed restaurant reviews made by 105,480 unique users, of which only 27,318 users visit more than 2 restaurants in the city. we only retain these 27,318 users in our graph and drop others. we also collect 215,544 user-contributed images for these restaurants. we construct the restaurant network by embedding venues and their attributes listed in table 1 as nodes. bimodal embeddings. we train separate bimodal embeddings by optimizing the heterogeneous skip-gram objective from eq. (3) using stochastic gradient descent and train embeddings for all metapaths enumerated in table 1 . we use restaurant nodes as root nodes for the unbiased random walks and perform 80 walks per root node, each with a walk length of 80. each embedding has a dimensionality of 48, uses a window-size of 5 and is trained for 200 epochs. embedding fusion models. we chose two fusion models in our experiments to analyze the efficacy of our embeddings: 1. simple concatenation model: we use a model that performs a simple concatenation of the individual metapath-specific embeddings detailed in sect. 3.4 to exhibit the baseline performance on the tasks detailed in sect. 4. simple concatenation is a well-established additive fusion technique in multimodal deep learning [18, 19] . each of the models uses a ridge regression algorithm to estimate the predictive power of each metapath-specific embedding on the volume regression task. this regressor is jointly trained with the attention model in the attention-weighted model. all models are optimized using stochastic gradient descent with the adam optimizer [12] with a learning rate of 0.1. in table 2 , we report the results from our experiments on the review-volume prediction task. we observe that metapaths with nodes containing categorical attributes perform significantly better than vector-based features. in particular, categorical attributes like cuisines, facilities, and price have a significantly higher coefficient of determination (r 2 ) as compared to visual feature nodes. it is interesting to observe here that nodes like locations, images, and textual reviews are far more numerous than categorical nodes and part of their decreased performance may be explained by the fact that our method of short walks may not be sufficiently expressive when the number of feature nodes is large. in addition, as mentioned in related work, we performed these experiments with the node2vec model, but since it is not designed for heterogeneous multimodal graphs, it yielded performance scores far below the weakest single modality. a review of the fusion models indicates that taking all the metapaths together can improve performance significantly. the baseline simple concatenation fusion model, commonly used in literature, is considerably better than the best-performing metapath (venues -facilities -venues). the attention basedmodel builds significantly over the baseline performance and while it employs a similar concatenation scheme as the baseline concatenation model, the introduction of the attention module allows it to handle noisy and unreliable modalities. the significant increase in the predictive ability of the attention-based model can be attributed to the fact that while all modalities encode information, some of them may be less informative or reliable than others, and therefore contribute less to the performance of the model. our proposed fusion approach is, therefore, capable of handling weak or noisy modalities appropriately. in this work, we propose an alternative, modular framework for learning from multimodal graphs. we use metapaths as a means to specify semantic relations between nodes and each of our bimodal embeddings captures similarities between restaurant nodes on a single attribute. our attention-based model combines separately learned bimodal embeddings using a late-fusion setup for predicting the review volume of the restaurants. while each of the modalities can predict the volume of reviews to a certain extent, a more comprehensive picture is only built by combining complementary information from multiple modalities. we demonstrate the benefits of our fusion approach on the review volume prediction task and demonstrate that a fusion of complementary views provides the best way to learn from such networks. in future work, we will investigate how the technique generalises to other tasks and domains. mantis: system support for multimodal networks of in-situ sensors hyperlearn: a distributed approach for representation learning in datasets with many modalities interaction networks for learning about objects, relations and physics heterogeneous network embedding via deep architectures the task-dependent effect of tags and ratings on social media access metapath2vec: scalable representation learning for heterogeneous networks m-hin: complex embeddings for heterogeneous information networks via metagraphs node2vec: scalable feature learning for networks deep residual learning for image recognition leveraging meta-path based context for top-n recommendation with a neural co-attention model multimodal network embedding via attention based multi-view variational autoencoder adam: a method for stochastic gradient descent distributed representations of sentences and documents deep collaborative embedding for social image understanding how random walks can help tourism image labeling on a network: using social-network metadata for image classification distributed representations of words and phrases and their compositionality multimodal deep learning multi-source deep learning for human pose estimation the pagerank citation ranking: bringing order to the web gcap: graph-based automatic image captioning deepwalk: online learning of social representations the visual display of regulatory information and networks nonlinear dimensionality reduction by locally linear embedding generating visual summaries of geographic areas using community-contributed images imagenet large scale visual recognition challenge heterogeneous information network embedding for recommendation semantic relationships in multi-modal graphs for automatic image annotation pathsim: meta path-based top-k similarity search in heterogeneous information networks line: large-scale information network embedding study on optimal frequency design problem for multimodal network using probit-based user equilibrium assignment adaptive image retrieval using a graph model for semantic feature integration heterogeneous graph attention network network representation learning with rich text information interactive multimodal learning for venue recommendation metagraph2vec: complex semantic path augmented heterogeneous network embedding key: cord-020841-40f2p3t4 authors: hofstätter, sebastian; zlabinger, markus; hanbury, allan title: neural-ir-explorer: a content-focused tool to explore neural re-ranking results date: 2020-03-24 journal: advances in information retrieval doi: 10.1007/978-3-030-45442-5_58 sha: doc_id: 20841 cord_uid: 40f2p3t4 in this paper we look beyond metrics-based evaluation of information retrieval systems, to explore the reasons behind ranking results. we present the content-focused neural-ir-explorer, which empowers users to browse through retrieval results and inspect the inner workings and fine-grained results of neural re-ranking models. the explorer includes a categorized overview of the available queries, as well as an individual query result view with various options to highlight semantic connections between query-document pairs. the neural-ir-explorer is available at: https://neural-ir-explorer.ec.tuwien.ac.at/. the prevalent evaluation of information retrieval systems, based on metrics that are averaged across a set of queries, distills a large variety of information into a single number. this approach makes it possible to compare models and configurations, however it also decouples the explanation from the evaluation. with the adoption of neural re-ranking models, where the scoring process is arguably more complex than traditional retrieval methods, the divide between result score and the reasoning behind it becomes even stronger. because neural models learn based on data, they are more likely to evade our intuition about how their components should behave. having a thorough understanding of neural reranking models is important for anybody who wants to analyze or deploy these models [6, 7] . in this paper we present the neural-ir-explorer: a system to explore the output of neural re-ranking models. the explorer complements metrics based evaluation, by focusing on the content of queries and documents, and how the neural models relate them to each other. we enable users to efficiently browse the output of a batched retrieval run. we start with an overview page showing all evaluated queries. we cluster the queries using their term representations taken from the neural model. users can explore each query result in more detail: we show the internal partial scores and content of the returned documents with different highlighting modes to surface the inner workings of a neural re-ranking model. here, users can also select different query terms to individually highlight their connections to the terms in each document. in our demo we focus on the kernel-pooling models knrm [14] and tk [8] evaluated on the msmarco-passage [2] collection. the kernel-pooling makes it easy to analyze temporary scoring results. finally, we discuss some of the insights we gained about the knrm model using the neural-ir-explorer. the neural-ir-explorer is available at: https://neural-ir-explorer.ec.tuwien.ac.at/. our work sits at the intersection of visual ir evaluation and the interpretability of neural networks with semantic word representations. the ir community mainly focused on tools to visualize result metrics over different configurations: claire allows users to select and evaluate a broad range of different settings [1] ; aviator integrates basic metric visualization directly in the experimentation process [4] ; and the retrieval tool provides a data-management platform for multimedia retrieval including differently scoped metric views [9] . lipani et al. [11] created a tool to inspect different pooling strategies, including an overview of the relevant result positions of retrieval runs. from a visualization point of view term-by-term similarities are similar to attention, as both map a single value to a token. lee et al. [10] created a visualization system for attention in a translation task. transformer-based models provide ample opportunity to visualize different aspects of the many attention layers used [3, 13] . visualizing simpler word embeddings is possible via a neighborhood of terms [5] . now we showcase the capabilities of the neural-ir-explorer (sect. 3.1) and how we already used it to gain novel insights (sect. 3.2). the explorer displays data created by a batched evaluation run of a neural re-ranking model. the back-end is written in python and uses flask as web server; the front-end uses vue.js. the source code is available at: github.com/sebastian-hofstaetter/neural-ir-explorer. when users first visit our website they are greeted with a short introduction to neural re-ranking and the selected neural model. we provide short explanations throughout the application, so that that new users can effectively use our tool. we expect this tool's audience to be not only neural re-ranking experts, but anyone who is interested in ir. the central hub of the neural-ir-explorer is the query overview (fig. 1) . we organize the queries by clustering them in visually separated cards. we collapse the cards to only show a couple of queries per default. this is especially useful for collections with a large number of queries, such as the msmarco collection we use in this demo (the dev set contains over 6.000 queries). in the cluster header we display a manually assigned summary title, the median result of the queries, and median difference to the initial bm25 ranking, as this is the basis for the re-ranking. each query is displayed with the rank of the first relevant document, the difference to bm25, and the query text. the controls at the top allow to sort the queries and clusters -including a random option to discover new queries. users can expand all clusters or apply a term-prefix filter to search for specific words in the queries. once a user clicks on a query, they are redirected to the query result view (fig. 2) . here, we offer an information rich view of the top documents returned by the neural re-ranking model. each document is displayed in full with its rank, overall and kernel-specific scores. the header controls allow to highlight the connections between the query and document terms in two different ways. first, users can choose a minimum cosine similarity that a term pair must exceed to be colored, which is a simple way of exploring the semantic similarity of the word representations. secondly, for kernel-pooling models that we support, we offer a highlight mode much closer to how the neural model sees the document: based on the association of a term to a kernel. users can select one or more kernels and terms are highlighted based on their value after the kernel transformation. additionally, we enable users to select two documents and compare them side-by-side (fig. 3) . users can highlight query-document connections as in the list view. additionally, we display the different kernel-scores in the middle, so that users can effectively investigate which kernels of the neural model have the deciding influence of the different scores for the two documents. we already found the neural-ir-explorer to be a useful tool to analyze the knrm neural model and understand its behaviors better. the knrm model includes a kernel for exact matches (cosine similarity of exactly 1), however judging from the displayed kernel scores this kernel is not a deciding factor. most of the time the kernels for 0.9 & 0.7 (meaning quite close cosine similarities) are in fact the deciding factor for the overall score of the model. we assume this is due to the fact, that every candidate document (retrieved via exact matched bm25) contains exact matches and therefore it is not a differentiating factor anymore -a specific property of the re-ranking task. additionally, the neural-ir-explorer also illuminates the pool bias [12] of the msmarco ranking collection: the small number of judged documents per query makes the evaluation fragile. users can see how relevant unjudged documents are actually ranked higher than the relevant judged documents, wrongly decreasing the model's score. we presented the content-focused neural-ir-explorer to complement metric based evaluation of retrieval models. the key contribution of the neural-ir-explorer is to empower users to efficiently explore retrieval results in different depths. the explorer is a first step to open the black-boxes of neural re-ranking models, as it investigates neural network internals in the retrieval task setting. the seamless and instantly updated visualizations of the neural-ir-explorer offer a great foundation for future work inspirations, both for neural ranking models as well as how we evaluate them. claire: a combinatorial visual analytics system for information retrieval evaluation ms marco: a human generated machine reading comprehension dataset visualizing and measuring the geometry of bert a progressive visual analytics tool for incremental experimental evaluation interactive analysis of word vector embeddings let's measure run time! extending the ir replicability infrastructure to include performance aspects on the effect of lowfrequency terms on neural-ir models interpretable & time-budgetconstrained contextualization for re-ranking retrieval: an online performance evaluation tool for information retrieval methods interactive visualization and manipulation of attention-based neural machine translation visual pool: a tool to visualize and interact with the pooling method the impact of fixedcost pooling strategies on test collection bias a multiscale visualization of attention in the transformer model end-to-end neural ad-hoc ranking with kernel pooling acknowledgements. this work has received funding from the european union's horizon 2020 research and innovation program under grant agreement no. 822670. key: cord-020908-oe77eupc authors: chen, zhiyu; jia, haiyan; heflin, jeff; davison, brian d. title: leveraging schema labels to enhance dataset search date: 2020-03-17 journal: advances in information retrieval doi: 10.1007/978-3-030-45439-5_18 sha: doc_id: 20908 cord_uid: oe77eupc a search engine’s ability to retrieve desirable datasets is important for data sharing and reuse. existing dataset search engines typically rely on matching queries to dataset descriptions. however, a user may not have enough prior knowledge to write a query using terms that match with description text. we propose a novel schema label generation model which generates possible schema labels based on dataset table content. we incorporate the generated schema labels into a mixed ranking model which not only considers the relevance between the query and dataset metadata but also the similarity between the query and generated schema labels. to evaluate our method on real-world datasets, we create a new benchmark specifically for the dataset retrieval task. experiments show that our approach can effectively improve the precision and ndcg scores of the dataset retrieval task compared with baseline methods. we also test on a collection of wikipedia tables to show that the features generated from schema labels can improve the unsupervised and supervised web table retrieval task as well. dataset retrieval is receiving more attention as people from different fields and domains start to rely on datasets for their work. there are many data portals with the purpose of effective and efficient data management and data sharing, such as data.gov 1 , datahub 2 and data.world 3 . most of those data portals use ckan 4 as their backend. however, there are two problems of dataset search engines using such infrastructure: first, ranking performance relies on the quality of metadata of datasets, while many datasets lack high quality metadata; second, the information in the metadata may not satisfy the user's information need or help them solve their task [3] . a user may not know the organization of a potentially relevant dataset, or the tags data publishers provide with a dataset. such information can hardly be used for dataset ranking. in this paper, we focus on the problem of dataset retrieval where dataset content is in tabular form, since tabular data is widely-used and easy to read and write. as illustrated in fig. 1 , a dataset consists of a data table (dataset content) and metadata. a data table usually has one header row, followed by one or more data rows. the header row consists of a list of schema labels (attribute names) whose actual values are stored in data rows. metadata usually includes title and description of the dataset. schema labels, which represent high-level concepts, are underutilized if we directly score them with a user query. consider the example in fig. 1 ; the vocabulary of schema labels could be very different from other fields and user queries. "locationabbr", standing for "location abbreviation", is unlikely to appear in a user query so this dataset is less likely to be recalled. however, we can enhance this dataset by generating schema labels such as "place" and "city" appearing in other, similar datasets, which could provide a better soft-matching signal with respect to a user query, and therefore increase the chance that it can be recalled. in this work, we first propose a new method for schema label generation. we learn latent feature representations of schema labels automatically by jointly decomposing the dataset-schema label interaction matrix and schema labelschema label interaction matrix. then we propose a framework for enhancing dataset retrieval by schema label generation to address the problem that schema labels are not effectively used by existing dataset search engines. we create a new public benchmark 5 based on federal (u.s.) datasets and use it to demonstrate the effectiveness of our proposed framework for dataset retrieval. we additionally consider a web table retrieval task and demonstrate that the features generated from schema labels can be effective for supervised ranking. dataset search has become a new research field with new challenges. chapman et al. [3] classify dataset search into basic and constructive dataset search. basic dataset search returns a list of existing datasets based on a user's query, while constructive dataset search [5] generates datasets on-the-fly based on a user's needs and query. google recently released a dataset search service 6 . like many other data portals, their service relies on metadata of datasets, annotated on web pages using a standard defined by schema.org. other work on applications of web tables is also related to our work. cafarella et al. [2] proposed webtables system which extract web tables from top ranked pages by keyword search. sekhavat et al. [13] proposed a probabilistic method that augments an existing knowledge base with facts from web tables. zhang et al. [16] developed generative probabilistic models to equip spreadsheets with smart assistance capabilities. specifically, given a table, they recommend additional rows and column headings by leveraging the information from the web tables. they also developed semantic matching features for table retrieval [17] . the techniques designed for web table analysis could potentially be applied to dataset search. in our work, each dataset is associated with data in tabular form. extracting useful information from tables such as entities and attribute names could help with the retrieval task. trabelsi et al. [14] recently proposed custom embeddings for column headers based on multiple contexts for table retrieval, and found representing numerical cell values to be useful. zhang et al. [16] proposed to use semantic concepts to represent queries and tables for ranking entity-focused tables. however, dataset search could be inherently more difficult since datasets do not need to be entity-focused. in this section, we introduce the framework of schema label enhanced dataset retrieval. as illustrated in fig. 2 , our framework has two stages: in the first stage, we first train a schema label generator with the method proposed in sect. 3.1 and use it to generate additional schema labels for all the datasets; in the second stage, we use a mixed ranking model to combine the scores of schema labels and other fields for dataset ranking. in the following subsections, we present a detailed illustration of the two stages. we propose to improve dataset search by making use of generated schema labels, since these can be complementary to the original schema labels and especially valuable when they are otherwise absent from a dataset. we treat schema label generation as a multi-label classification problem. let l = {l 1 , l 2 , ..., l k } denote the labels appearing in all datasets and d = {(x i , y i )|1 ≤ i ≤ n} denote the training set. here, for each training sample (x i , y i ), x i is a d-dimensional feature vector of column i which can be calculated from data rows [4] or learned from matrix factorization proposed later in this section. y i is k-dimensional vector [y i 1 , y i 2 , ..., y i k ] and y i j = 1 only if x i is relevant to label l j , otherwise y i j = 0. our objective is to learn a function that models p (l|x i ), (l ∈ l). to generate m schema labels for column i, we can select the top m labels l m by: we could also generate schema labels by selecting a probability threshold θ: in practice, we could first generate the top m schema labels and filter out those results with a probability lower than the threshold. chen et al. [4] proposed to predict schema labels based on curated features of data values. instead of designing curated features for schema labels, we consider learning their representations in an automated manner. inspired by collaborative filtering methods in recommender systems, we model each dataset as a user and each schema label as an item. then a dataset with a schema label can be considered as positive feedback between a user and an item. by exploiting the user-item co-occurrences and item-item co-occurrences, we can learn the latent representations of schema labels. in the following, we show how to construct a preference matrix in the context of schema label generation and how to learn the schema label features. with m data tables and n unique schema labels, we can construct a dataset-column preference matrix m m×n , where m up is 1 if dataset u contains schema label p. matrix factorization. mf [7] decomposes m into the product of u m×k and p k×n where k < min(m, n). u t can be denoted as (α 1 , ..., α u ..., α m ) where α u ∈ r k represents the latent factor vector of dataset u. similarly, p t can be denoted as (β 1 , ..., β p ..., β n ) where β p ∈ r k represents the latent factor vector of schema label p. since the preference matrix actually models the implicit feedback, mf optimizes the following objective function: where c up is a hyperparameter tuned to balance the non-zero and zero values since m is a sparse matrix. λ α and λ β are regularization parameters that adjust the importance of regularization terms u α u 2 and p β p 2 . label embedding. recently, word embedding techniques (e.g., word2vec [11] ) have been valuable in natural language processing tasks. given a sequence of words, a low-dimensional continuous representation called word embedding can be learned for each word. word2vec's skip-gram model with negative sampling (sgns) is equivalent to implicitly factorizing a word-context matrix, whose cells are the pointwise mutual information (pmi) of the respective word and context pairs, shifted by a global constant [9] . the pmi between word i and its context word j is defined as: where #(i, j) is the number of times word j appears in the context window of word i and |d| is the total number of word-context pairs. then, a shifted positive pmi (sppmi) of word i and word j is calculated as: where k is the number of negative samples of sgns. given a corpus, matrix m sp p mi can be constructed based on eq. (2) and factorizing it is equivalent to performing sgns. a schema label exists in the context of other schema labels. therefore, we perform word embedding techniques to learn the latent representations of schema labels. however, we do not consider the order of schema labels. therefore, given a schema label, all other schema labels which come from the same data table are considered as its context. with the constructed sspmi matrix of co-occurring schema labels, we are able to decompose it to learn the latent representations of schema labels. joint learning of schema label representations. schema label representations learned from mf capture the interactive information between datasets and schema labels, while the word2vec style representations explain the cooccurrence relationships of schema labels. we use the cofactor model [10] to jointly learn schema label representations from both dataset-label interaction and label-label interaction: from the objective function we can see the schema label representation β p is shared between mf and schema label embedding. γ i is the latent representation of context embedding. b p and c i are the schema label embedding bias and context embedding bias, respectively. the last line of eq. 3 incorporates regularization terms with different λ controlling their effects. we use the vector-wise als algorithm [15] to optimize the parameters. after obtaining the jointly learned representations of schema labels, we can use them as features for schema label generation. in this paper, we use the concatenation of schema label representations introduced here and the curated features proposed by chen et al. [4] to construct each x i . any multi-label classification models can be used to train the schema label generator and in this paper we choose random forest. based on the schema label generation method proposed above, we index the generated schema labels for each dataset. now, each dataset has the following fields: metadata, data rows, schema labels and generated schema labels. a straightforward way to rank datasets is to use traditional ranking methods for documents. zhang and balog [17] represent tables as single field documents or multifield documents for table retrieval task. for single field document representation, a dataset is treated as a single document by concatenating the text from all the fields. then traditional methods such as bm25 can be used to score the dataset. for multifield document representation, each field is scored independently against the query and a weighted sum is used for ranking. in our schema label mixed ranking (slmr) model, we score schema labels differently from other fields. the focus of our work is to learn how schema labels, data rows and other metadata may differently influence dataset retrieval performance. note that, for simplicity, we consider the other metadata (title and description) as a single text field, since title and description are homogeneous compared with schema labels and data rows. therefore, we have the following scoring function for a dataset d: where f text denotes the concatenation of title and description, f data denotes the data table, and f l denotes the generated schema labels. each field has a corresponding weights. f text and f data have the same scoring function score text while f l has a different scoring function score l . for f text and f data , we can use a standard scoring function for normal documents. in the experiments, we use bm25 as score text . due to the existence of a large number of non-dictionary words in schema labels [4] that would otherwise be outside of the vocabulary of a word-based embedding, we represent schema labels and query terms using fasttext [1] in score l , since such word embeddings are calculated from character n-grams instead of terms. to score the schema labels with respect to a query, we use the negative word mover's distance (wmd) [8] . wmd measures the dissimilarity between two text documents as the minimum amount of distance that the word embeddings of one document need to "travel" to reach the word embeddings of another document. so score l (q, f l ) = −wmd(f asttext(q), f asttext(f l )) reflects the semantic similarity between a query and schema labels. here we describe how we construct the new benchmark for dataset retrieval in detail. we collected 2417 resources published by the u.s. federal government from data.gov which cover a variety of topics. each resource includes one or more csv format data tables and corresponding metadata. each csv table is treated as a single dataset and we use the resource-level metadata to annotate each dataset. we created six tasks in which each describes a separate information need to find one or more datasets. for each, we have a statement about the information need which describes what datasets are considered as relevant. we additionally verified for each task the existence of at least one relevant dataset. the dataset is public available 7 . we used amazon mechanical turk 8 to obtain diverse queries for these tasks from real users. every annotator was presented with the task descriptions and asked to provide a query for each created task. to avoid the impact of task order on the quality of annotations, we randomly shuffled the order of tasks for each annotator. we paid one dollar for each completed annotation job and 20 queries were collected for each task. every collected query was manually examined and obviously unrelated queries were excluded from the collection. for each task and each suggested query, we used traditional ranking functions to score single field representations of each dataset and collect the top 100 results. the following ranking models were used: bm25, tf-idf, language model based on jelinek-mercer smoothing, and language model with dirichlet smoothing. we also used each model with two different representations: the concatenation of all fields of the dataset and the concatenation of title and description. this leads to eight baselines for the pooled results. then, the collected task-dataset pairs were annotated for relevance using the crowdsourcing service provided by figure eight 9 . we did not annotate the querydataset pairs because the goal of dataset retrieval is to find relevant datasets with respect to a task which represents the real information need. annotators were presented with the task title, description and link to the data table. each task-dataset pair was judged on a four point scale: 0 (off topic), 1 (poor), 2 (good), and 3 (excellent). 10 every annotator was paid 10 cents per task-dataset judgement. every single task-dataset pair was judged by three annotators and we take the majority vote as the relevance label. if no majority agreement is achieved, we take the average of the scores as the final label. the statistics of annotation results is shown in table 1 . we evaluate dataset retrieval performance over a range of metrics: precision at k and normalized discounted cumulative gain (ndcg) at k [6] . to test the significance of differences between model performances, we use paired t-tests with significance at the p = 0.01 level. we first present the baseline retrieval methods. a dataset is considered as a single document. we use bm25 to score the concatenation of title and description, the text of the data table and the concatenation of all of them. by comparing the three results, we can learn about field level importance for dataset retrieval. parameters are chosen by grid search. multifield document ranking (mdr). by setting w l = 0, eq. (4) degenerates to the mixture of language models [12] . bm25 is also used here as score text () in order to have a fair comparison with other methods. to optimize field weights, we use coordinate ascent. finally, smoothing parameters are optimized in the same manner as single-field document ranking. in this section, we examine the following research questions: q1 does data table content help in dataset retrieval? q2 do generated schema labels help in dataset retrieval? q3 which fields are most important for the dataset retrieval task? we first obtain features of schema labels as described in sect. 3.1 and the number of latent factors is set to 40. then we train a random forest with the learned schema label features. the scikit-learn implementation of random forest 11 is used with default parameters except the number of trees is set to 25. in practice, we could choose any multi-label classifier. for each column, we select the top 10 generated schema labels and filter those with probability lower than 0.5. for each dataset, we index the generated schema labels as an additional field. table 2 summarizes the ndcg at k and precision at k of different models. note that, for schema label mixed ranking (slmr), we trained three different models and the weights of used fields were forced to be non-zero in order to study the proposed research questions. the weights of used fields for multifield document representation are also set non-zero when optimizing the parameters. from the results of single-field document ranking, we can see that only utilizing the data table for ranking leads to the worst performance. scoring on the concatenation of title and description achieved the best results, which indicates that title and description are more important than the data table for ranking a dataset (q3). treating all fields of a dataset as a single-field document provides performance between the previous two models. this result is expected since the length of data tables are usually much larger than titles and descriptions, and therefore dominate the table representation. by comparing the results of single-field and multifield document ranking, we observe that the combination of the scores of data table, title and description could improve ndcg@k. though ndcg@k decreases when k increases, the relative improvement against single-field document ranking are more significant. in contrast, for precision@5, precision@10, single-field document ranking performs better than multifield document ranking, though the differences are small. so for q1, under the setting of multifield document ranking, the content of the data table could help ndcg, but not help precision of dataset retrieval results. without scoring data tables, our proposed schema label mixed ranking approach achieves the highest ndcg on all the rank cut-offs, which indicates that the generated schema labels can be useful to improve the ndcg of dataset retrieval results (q2). though precision@20 of multifield document ranking are higher than our proposed model, the difference is no more than 0.4% (p value > 0.9). significantly, our model outperforms by 21.3% for precision@5 ) than the best baseline methods (p value < 0.01). whether data tables are scored or not, precision@k is not significantly different for schema label mixed ranking. therefore, under the setting of schema label mixed ranking, data tables make little contribution in this scenario (q1). one possible reason could be that data tables collected from data.gov contain large quantities of numerical values and will rarely be used to match user queries. if a schema label mixed ranking model scores only on titles and descriptions (w l = 0), it is equivalent to single-field ranking model scoring on titles and descriptions. therefore, we can compare the results in first and fifth rows in table 2 . with generated schema labels, the ranking model can have a higher performance on dataset retrieval task (q2). the task of dataset search is similar to web [16] using the method proposed in sect. 3.1. then we append five additional features to their proposed features 12 based on schema labels. each feature is one type of semantic similarity between query and schema labels. four features are calculated using the measurement proposed by zhang and balog (one early fusion feature, three late fusion features) and the last feature is the negative of word mover's distance. finally, like zhang and balog, we use random forest to perform pointwise regression and the final reported results are averaged over five runs of 5-fold cross-validation and shown in table 3 . we can see that schema label features along cannot outperform str. but combining them results in improvement. however, by calculating the normalized feature importance measured in terms of gini score, we find that for str with schema label features, wmd based measurement contributes the most among all the semantic features. thus it demonstrates that the schema labels can be valuable for the table retrieval task as well. notably, in this table corpus, many tables lack much table content but contain rich text descriptions, which could be unfair for schema label generation-based methods. while for dataset search, each table has values but may lack high quality dataset descriptions. we believe that our schema label generation method can outperform str in the scenario where text descriptions provide less useful information than the table itself. we also show unsupervised ranking results with eq. 4 in table 4 . unlike zhang and balog [16] , we consider page title, section title and caption as a single text field, in order to reduce the number of hyperparameters (field weights). the results show that generated labels are more effective than original labels for table ranking. it is unsurprising because generated labels often include not only original labels but also additional labels that can benefit the ranking model. we also notice that including the data table field achieves better results than not scoring it, which is contrary to the results of dataset ranking. it is also expected since wikitables are entity-focused and include a lot of text information while data tables from data.gov include more numeric values. in this paper, we have proposed a schema label enhanced ranking framework for dataset retrieval. the framework has two stages: in the first stage, a schema label generator is trained to generate additional schema labels for each dataset column; in the second stage, given a user query, datasets are ranked by their original fields together with generated schema labels. schema label generation is treated as a multi-label classification task in which each column of a dataset is associated with multiple schema labels. instead of using hand-curated features, we learn the latent feature representations of schema labels by a cofactor model in which the dataset-schema label interactions and schema label-schema label interactions are captured. with the schema label mixed ranking model, the traditional ranking scores for text fields (title, description, data rows) and word embedding-based scores for generated schema labels can be used to rank the datasets. we created a new benchmark to evaluate the performance of dataset retrieval. the experimental results demonstrate our proposed framework can effectively improve the performance on the dataset retrieval task. it achieved the highest ndcg on all the rank cut-offs compared with all baseline methods. we also apply our method to the web table retrieval task which is similar to dataset search and find that the features generated from schema labels can help in supervised ranking as well. enriching word vectors with subword information webtables: exploring the power of tables on the web dataset search: a survey generating schema labels through dataset content analysis extending rapidminer with data search and integration capabilities cumulated gain-based evaluation of ir techniques matrix factorization techniques for recommender systems from word embeddings to document distances neural word embedding as implicit matrix factorization factorization meets the item embedding: regularizing matrix factorization with item co-occurrence distributed representations of words and phrases and their compositionality combining document representations for known-item search knowledge base augmentation using tabular data improved table retrieval using multiple context embeddings for attributes parallel matrix factorization for recommender systems entitables: smart assistance for entity-focused tables ad hoc table retrieval using semantic similarity acknowledgment. this material is based upon work supported by the national science foundation under grant no. iis-1816325. key: cord-020820-cbikq0v0 authors: papadakos, panagiotis; kalipolitis, orfeas title: dualism in topical relevance date: 2020-03-24 journal: advances in information retrieval doi: 10.1007/978-3-030-45442-5_40 sha: doc_id: 20820 cord_uid: cbikq0v0 there are several concepts whose interpretation and meaning is defined through their binary opposition with other opposite concepts. to this end, in this paper we elaborate on the idea of leveraging the available antonyms of the original query terms for eventually producing an answer which provides a better overview of the related conceptual and information space. specifically, we sketch a method in which antonyms are used for producing dual queries, which can in turn be exploited for defining a multi-dimensional topical relevance based on the antonyms. we motivate this direction by providing examples and by conducting a preliminary evaluation that shows its importance to specific users. dualism denotes the state of two parts. the term was originally coined to denote co-eternal binary opposition and has been especially studied in philosophy. for example, there is duality in ethics (good -bad), in human beings (man -nietzsche'sübermensch or man -god) and in logic (true -false). in addition, dualism determines in a great extent our everyday lives (ugly -beautiful, happyunhappy, etc.), and our relations with other people (rich -poor, black -white, love -hate, etc.). none of these concepts can be understood without their dual concepts, since this duality and opposition generates their meaning and interpretation. dualism is also crucial in mathematics and physics (e.g., matterantimatter), and is the power behind our whole information society and our binary data. moving from philosophy, sciences and everyday life to information retrieval, we find a very vague situation. users of search engines are 'dictated' to provide a very concise and specific query that is extremely efficient for focalized search (e.g., looking for a specific hotel). on the other hand, studies show that 60% of user tasks are of exploratory nature [12] . in such tasks users do not accurately know their information need and can not be satisfied by a single 'hit' [5] . consequently, users spend a lot of time reformulating queries and investigating results, in order to construct a conceptual model regarding their information need. information needs that include non-monosemous terms can be considered such exploratory tasks. however, the simplicity of inserting terms in an empty text box and 'magically' return the most relevant object(s), will always be a desired feature. in this paper we elaborate on the idea of leveraging the available antonyms of the original query terms (if they exist), for eventually producing an answer which provides a better overview of the related information and conceptual space. we sketch a method in which antonyms are used for producing dual queries, which in turn can be exploited for defining a multi-dimensional topical relevance. this approach can be applied on demand, helping users to be aware of the various opposing dimensions and aspects of their topic of interest. a preliminary evaluation shows the value of the approach for some exploratory tasks and users. to the best of our knowledge, the proposed direction is not covered by the existing literature. antonyms have been studied in fuzzy logic [7] showing a relation with negates. in the ir domain, query expansion methods are based on synonyms and semantically related terms, but do not exploit antonyms explicitly, while in relevance and pseudo-relevance feedback techniques the antonyms are essentially penalized [1] . results diversification can produce a kind of dual clusters, but this is neither guaranteed nor controlled [3] . "capitalism and war". consider a user exploring the relationship between capitalism and war. the user submits to a wse (web search engine) the query "capitalism and war" and starts inspecting the results. the left part of fig. 1 shows the top-5 results for this query from a popular wse. the results include articles about the connection of capitalism with war from research and academic domains, as well as from socialistic, communistic and theological sites. considering a different direction, the user might also be interested about how capitalism can support peace, the dual of war. the top-5 results for the query "capitalism and peace" are shown at the right side of fig. 1 . they contain a wikipedia and a research article about the capitalist peace theory, and articles about the importance of capitalism for the prosperity of modern societies and its association to peace from policy research organizations. analogously, since socialism is the economic system that opposes capitalism, the user could be interested about how socialism may promote war or support peace, by inspecting the results of the queries "socialism and war" and "socialism and peace" respectively. the top-5 results for each of the above queries are shown in fig. 2 . the results for the former query include the socialism and war pamphlet written by lenin, a collection of articles by the economist and philosopher friedrich hayek, a list of articles from two marxist domains, and a critical article for both left and right views from the foundation for economic education. for the latter query, the results include articles connecting socialism with peace, like a chapter from the encyclopedia of anti-revisionism, a wikipedia article about the theoretical magazine problems of peace and socialism, and an article from a site supporting a far left u.s. party. the above hits indicate interesting directions to the original information need of the user. we argue that users should get aware of these directions for a better exploration of the domain at hand, since they can provide a more comprehensive view of the information and conceptual space. furthermore, the exploration of these directions let available supportive or counter arguments of dual concepts to emerge, leading to better informed and responsible humans and citizens. "aloe". a comprehensive view of the various different directions can be beneficial also for reducing false-positive results. for example, consider a pregnant woman that was advised to take aloe vera by mouth to relieve digestive discomfort. to check if this is true, she submits to a wse the query "aloe vera indications". however, since aloe can stimulate uterine contractions, increasing the risk of miscarriage or premature birth, it is crucial to know also its contraindications. the proposed direction can alleviate this problem, because this information would be contained in the results of the query "aloe vera contraindications". one can imagine various ways for leveraging antonyms. we shall hereafter use t t to denote that the terms t and t are antonyms. building on the "capitalistic" example of the previous section, according to the online dictionary wordnet 1 , socialism capitalism, and war peace. now, we can generate all possible queries, denoted by q, where non-monosemous terms of the original query are substituted by their dual ones, as expressed by their antonyms. for example, the query "capitalism and war" will generate three extra queries: "socialism and peace", "capitalism and peace" and "socialism and war". based on q we can now define two vector spaces. in the first case, the space has |q| dimensions, where each query is a dimension of the space. each document is placed in this space according to its relevenace to each query. in the second case we assume a space with only |q| 2 dimensions. each dimension represents a pair of dual queries, where each query in the pair contains the antonyms of the other. we denote with q q , that the queries q and q are dual. for our running example, the first pair is ("capitalism and war","socialism and peace") and the second one is ("capitalism and peace","socialism and war"). each pair defines an axis, therefore the two pairs define a 2d space against which we can evaluate the "value" of each document. for each axis we can consider policies for composing the relevance scores of each document to each member of a dual query. generally, there are various criteria that can be considered for assessing the value of each document or set of documents. such criteria include the bias of documents to specific queries (e.g., the original user query), the purity to a specific query, the overview factor of a document regarding either a dual query or all queries, and the diversity of the returned set of documents with respect to these queries. in general, we need to define appropriate ranking methods, that will take into account the relevance of the documents to the available queries for different criteria. therefore, we will explore whether the existing multiplecriteria approaches described in [4, 6, 9, 13] are appropriate for the problem at hand. regarding the process of finding the corresponding antonyms, we can use existing dictionaries like wordnet for nouns and adjectives or word-embedding antonym detection approaches like [8, 11] . the case of verbs and adverbs is more complicated since they require a kind of grammatical and language analysis (i.e., exist not exist, lot total, a lot bit, etc). there are three categories of antonyms: (a) gradable, (b) relational and (c) complementary. we have gradable antonyms (e.g., hot cold) in cases where the definitions of the words lie on a continuous spectrum. we have relational antonyms (e.g., teacher student) in cases where the two meanings are opposite only within the context of their relationship. the rest are called complementary antonyms (e.g., day night). in general, the selection of the "right" antonyms raises various questions. in many cases more than one antonyms exist, so one should decide which one(s) to select. sometimes this can depend on the context, e.g., the antonym of "action" is "apathy", but in terms of physics or sociology the dual of "action" is "reaction". notice that the proposed approach can be exploited in any context where the aim is to retrieve semantically opposing entities, information, etc. as an example consider the argument web [2] , where the approach could be used for retrieving contradicting arguments and providing support for each one of them. from a system's perspective, the approach can be realized in various levels and settings. in the setting of an ir system, it can be implemented by changing accordingly the query processor and the ranking module, while in a meta-search setting, by changing the query rewriting, the query forwarding and the ranking components. it could also be exploited in the query autocompletion layer. to start with, we have conducted a preliminary evaluation. we have specified 15 information tasks which are shown in table 1 , that can exploit the proposed approach. the tasks are of exploratory nature and were created using the task refinement steps described in [10] . we have identified the following types of tasks: explore domain (ed), medical treatment (mt), explore product reviews (epr) and person qualities (pq). for each task we provide a description of the information need, a representative query and the relevant antonyms, which were manually selected from the list of the respective wordnet antonyms. we conducted our experiment over 9 female and 25 male users of various ages. for each task, they were given two lists of results. one contained the results of the query from a popular wse, and the other one was constructed by interleaving the results of the same wse for the dual queries of this task (i.e., first the top result of the original query, then the first result of its dual, etc.). the two kinds of lists were given in random order for each task. the users were asked to select the most preferred list and to provide a grade of preference taking values in {1, 2, 3, 4, 5}, where 5 means that the selected list was preferred much more than the other one. in the background, when users prefer the results of the dual approach, we change the sign of the score and make it negative. the users were not aware how the lists were constructed and were not guided in any way by the evaluator. in fig. 3 we provide two graphs that describe the results of the evaluation. figure 3 (a), shows the aggregated scores given by all users to each query, while fig. 3 (b) shows the aggregated scores given by each participant to all queries. regarding the first one the results are not the expected ones, although we hypothesize that the users mainly penalized the dual approach because of the 'irrelevant' results to the original query in terms of query tokens and not in terms of relevant information. for eleven of the queries there is a strong preference towards the non-dual approach. the epr type of queries belong to this category, showing that users are probably not interested for reviews with the opposite direction of what they are looking for. this is especially true for q 12 , where the dual approach provided results about winter vacations and was the least preferred. for two of the tasks, the approaches are almost incomparable. both of these tasks belong to the mt group. there are also two queries, q 3 and q 15 , where the dual approach is better, especially in the last one. in their comments for these queries, users mention that the selected (i.e., dual) list "provides a more general picture" and "more relevant and interesting results, although contradicting". regarding the second graph we have the interesting result that the proposed approach appeals to specific users. it seems that nine users (26% of the participants) have an exploratory nature and generally prefer the dual approach (six of them strongly), while for four of them the two approaches are incomparable. the rest are better served with the non-dual approach. this is an interesting outcome, and in the future we plan to identify the types of users that prefer the dual approach. we have motivated with examples why it is worth investigating dualism for nonmonosemous terms in the context of exploratory search and we have shown its importance at least for some types of users and tasks. for the future, we plan to define the appropriate antonyms selection algorithms and relevance metrics, implement the proposed functionality in a meta-search setting, and conduct a large scale evaluation with real users over exploratory tasks, to identify in which queries the dual approach is beneficial and to what types of users. query expansion techniques for information retrieval: a survey implementing the argument web evaluating subtopic retrieval methods: clustering versus diversification of search results multidimensional relevance: a new aggregation criterion supporting exploratory search multidimensional relevance: prioritized aggregation in a personalized information retrieval setting on antonym and negate in fuzzy logic improving word embeddings for antonym detection using thesauri and sentiwordnet negotiating a multidimensional framework for relevance space creating exploratory tasks for a faceted search interface word embedding-based antonym detection using thesauri and distributional information understanding user goals in web search relevance: a review of the literature and a framework for thinking on the notion in information science. part ii: nature and manifestations of relevance key: cord-020890-aw465igx authors: brochier, robin; guille, adrien; velcin, julien title: inductive document network embedding with topic-word attention date: 2020-03-17 journal: advances in information retrieval doi: 10.1007/978-3-030-45439-5_22 sha: doc_id: 20890 cord_uid: aw465igx document network embedding aims at learning representations for a structured text corpus i.e. when documents are linked to each other. recent algorithms extend network embedding approaches by incorporating the text content associated with the nodes in their formulations. in most cases, it is hard to interpret the learned representations. moreover, little importance is given to the generalization to new documents that are not observed within the network. in this paper, we propose an interpretable and inductive document network embedding method. we introduce a novel mechanism, the topic-word attention (twa), that generates document representations based on the interplay between word and topic representations. we train these word and topic vectors through our general model, inductive document network embedding (idne), by leveraging the connections in the document network. quantitative evaluations show that our approach achieves state-of-the-art performance on various networks and we qualitatively show that our model produces meaningful and interpretable representations of the words, topics and documents. document networks, e.g. social media, question-and-answer websites, the scientific literature, are ubiquitous. because these networks keep growing larger and larger, navigating efficiently through them becomes increasingly difficult. modern information retrieval systems rely on machine learning algorithms to support users. the performance of these systems heavily depends on the quality of the document representations. learning good features for documents is still challenging, in particular when they are structured in a network. recent methods learn the representations in an unsupervised manner by combining structural and textual information. text-associated deepwalk (tadw) [28] incorporates text features into the low-rank factorization of a matrix describing the network. graph2gauss [2] learns a deep encoder, guided by the network, that maps the nodes' attributes to embeddings. gvnr-t [3] factorizes a random walk based matrix of node co-occurrences and integrates word vectors of the documents in its formulation. cane [25] introduces a mutual attention mechanism that builds representations of a document contextually to each of its direct neighbors in the network. apart from graph2gauss, these methods are not intended to generate representations for documents with no connection to other documents and thus cannot induce a posteriori representations for new documents. moreover, they provide little to no possibility to interpret the learned representations. cane is a notable exception since its attention mechanism produces interpretable weights that highlight the words explaining the links between documents. nevertheless, it lacks the ability to explain the representations for each document independently. in this paper, we describe and evaluate an inductive and interpretable method that learns word, topic and document representations in a single vector space, based on a new attention mechanism. our contributions are the following: -we present a novel attention mechanism, topic-word attention (twa), that produces representations of a text where latent topic vectors attend to the word vectors of a document; -we explain how to train the parameters of twa by leveraging the links of the network. our method, inductive document network embedding (idne), is able to produce representations for previously unseen documents, without network information; -we quantitatively assess the performance of idne on several networks and show that our method performs better than recent methods in various settings, including when new documents, not part of the network, are inductively represented by the algorithms. to our knowledge, we are the first to evaluate this kind of inductive setting in the context of document network embedding; -we qualitatively show that our model learns meaningful word and topic vectors and produces interpretable document representations. the rest of the paper is organized as follows. in sect. 2 we survey related works. we present in details our attention mechanism and show how to train it on networks of documents in sect. 3. next, in sect. 4, we present a thorough experimental study, where we assess the performance of our model following the usual evaluation protocol on node classification and further evaluating its capacity of inducting representations for text documents with no connection to the network. in sect. 5, we study the ability of our method to provide interpretable representations. lastly, we conclude this paper and provide future directions in sect. 6. the code for our model, the datasets and the evaluation procedure are made publicly available 1 . network embedding (ne) provides an efficient approach to represent nodes in a low dimensional vector space, suitable for solving various machine learning tasks. recent techniques extend ne for document networks, showing that text and graph information can be combined to improve the resolution of classification and prediction tasks. in this section, we first cover important works in document ne and then relate recent advances in attention mechanisms. deepwalk [22] and node2vec [9] are the most well-known ne algorithms. they train dense embedding vectors by predicting nodes co-occurrences through random walks by adapting the skip-gram model initially designed for word embedding [19] . verse [24] propose an efficient algorithm that can handle any type of similarity over the nodes. text-associated deepwalk (tadw) [28] extends deepwalk to deal with textual attributes. yang et al. prove, following the work in [17] , that skip-gram with hierarchical softmax can be equivalently formulated as a matrix factorization problem. tadw then consists in constraining the factorization problem with a pre-computed representation of the documents t by using latent semantic analysis (lsa) [6] . the task is to optimize the objective: where m = (a + a 2 )/2 is a normalized second-order adjacency matrix of the network, w is a matrix of one-hot node embeddings and h a feature transformation matrix. final document embeddings are the concatenation of w and ht . graph2gauss (g2g) [2] is an approach that embeds each node as a gaussian distribution instead of a vector. the algorithm is trained by passing node attributes through a non-linear transformation via a deep neural network (encoder). gvnr-t [3] is a matrix factorization approach for document network embedding, inspired by glove [21] , that simultaneously learns word, node and document representations. in practice, the following least-square objective is optimized: where x ij is the number of co-occurrences of nodes i and j, u i is a one-hot encoding of node i and δj w |δj |1 is the average of the word embeddings of document j. context-aware network embedding (cane) [25] consists in a mutual attention mechanism trained on a document network. it learns several embeddings for a document according to its different contextual documents, represented by its neighbors in the network. the attention mechanism selects meaningful features from text information in pairs of documents that explain their relatedness in the graph. a similar approach is presented in [4] where the links between pairs of documents are predicted by computing the mutual contribution of their word embeddings. in this work, we aim at constructing representations of documents that reflect their connections in a network. a key motivation behind our approach is to be able to predict a document's neighborhood given only its textual content. this allows our model to inductively produce embeddings for new documents for which no existing link is known. to that extend, graph2gauss is a similar approach. on the contrary, tadw and gvnr-t are not primarily designed for this purpose as they both learn one-hot embeddings for each node in the document network. note that if some methods like graphsage [10] , sdne [27] and gae [13] also enable induction on new nodes, they cannot deal with nodes that have no known connection. also, our approach differs from cane since this latter needs the neighbors of a document to generate its representation. idne learns to produce a single interpretable vector for each document in the network. in the next section, we review recent works in attention mechanisms for natural language processing (nlp) that inspired the conception of our method. an attention mechanism uses a contextual representation to highlight or hide some parts of input data. attention is an essential element of state-of-the-art neural machine translation (nmt) algorithms [18] by providing a powerful way to capture dependencies between words. the transformer [26] introduces a formalism of attention mechanisms for nmt. given a query vector q, a set of key vectors k and a set of value vectors v , an attention vector is produced with the following formula: qk t measures the similarity between the query and each key k of k. ω is a normalization function such that all attention weights are positive and sum to 1. v a is then the weighted sum of the values v according to the attention weights. multiple attention vectors can be generated by using a set of queries q. in cane, as for various nlp tasks [7] , an attention mechanism generates attention weights that represent the strengths of relation between pairs of input words. however, in this paper, we do not seek to learn dependencies between pairs of words, but rather between words and some global topics. in this direction, the set transformer [16] constitutes a computationally efficient attention mechanism where the queries are replaced with a fixed-size set of learnable global inducing points. this model is originally not intended for nlp tasks, therefore we will explore the capacity of such inducing points to play the role of topic representations when applied to textual data. even if we introduce the concept of topic vectors, the aim of this work is not to propose another topic model [5, 23] . we hypothesize that the introduction of global topic vectors in an attention mechanism can (1) lead to useful representations of documents for different tasks and (2) bring an interpretable sight on the patterns learned by the model. interpretability can help both machine learning practitioners to better refine their models and end users to understand automated recommendations. we are interested in finding low dimensional vector space representations of a set of n d documents organized in a network, described by a document-term matrix x ∈ n n d ×nw and an adjacency matrix a ∈ n n d ×n d , where n w stands for the number of words in our vocabulary. the method we propose, inductive document network embedding (idne), learns to represent the words and topics underlying the corpus in a single vector space. the document representations are computed by combining words and topics through an attention mechanism. in the following, we first describe how to derive the document vectors from known word and topic vectors through a novel attention mechanism, the topic-word attention (twa). next, we show how to estimate the word and topic vectors, guided by the links connecting the documents of the network. we assume a p-dimensional vector space in which both words and topics are represented. we note w ∈ r nw×p the matrix that contain the n w word embedding vectors and t ∈ r nt×p the matrix of n t topic vectors. figure 1 shows the matrix computation of the attention weights. topic-word attention. given a document i and its bag-of-word encoding x i ∈ n + nw , we measure the attention weights between topics and words, z i ∈ r nt×nw , as follows: the activation function g must satisfy two requirements: (1) all the weights are non-negative and (2) columns of z i sum to one. the intuition behind the first requirement is that enforcing non-negativity should lead to sparse and interpretable topics. the second requirement transforms the raw weights into wordwise relative attention weights, which can be read as probabilities similarly to what is done in neural topic models [23] . an obvious choice would be columnwise softmax, however, we empirically find that relu followed by a column-wise normalization performs best. document representation. given z i , we are able to calculate topic-specific representations of the document i. from the perspective of topic k, the p-dimensional representation of document i is: similarly to eq. 3, each topic vector, akin to a query, attends to the word vectors that play the role of keys to generate z i . the topic-specific representations are then the weighted sum of the values, also played by the word vectors. the final document vector is obtained by simple summation of all the topic-specific representations, which leads to d i = k d i k . scaling by 1 |xi|1 in eq. 5 ensures that the document vectors have the same order of magnitude as the word vectors. since the corpus is organized in a network, we propose to estimate the parameters, w and t , by leveraging the links between the documents. we posit that the representations of documents connected by a short path in the network should be more similar in the vector space than those that are far apart. thus, we learn w and t in a supervised manner, through the training of a discriminative model. let δ ∈ {0, 1} n d ×n d be a binary matrix, so that δ ij = 1 if document j is reachable from document i and δ ij = 0 otherwise. we model the probability of a pair of documents to be connected, given their representations, in terms of the sigmoid of the dot-product of d i and d j : assuming the document representations are i.i.d, we can express the loglikelihood of δ given w and t : through the maximization of this log-likelihood via a first-order optimization technique, we back-propagate the gradient and thus learn the word and topic vectors that lead to the document representations that best reconstruct δ. common tasks in document network embedding are classification and link prediction. we assess the quality of the representations learned with idne for these tasks in two different settings: (1) a traditional setting where all links and documents are observed and (2) an inductive setting where only a fraction of the links and documents is observed during training. the first setting corresponds to a scenario where the goal is to propagate labels associated with a small portion of the documents. the second represents a scenario where we want to predict labels and links for new documents that have no network information, once the algorithm is already trained. this is common setting in real world applications. as an example, when a new user asks a new question on a q&a website, we would like to suggest tags for its question and to recommend potential similar questions. in this case, the only information available to the algorithm is the textual content of the question. we detail here the setup we use to train idne. computing the δ matrix. we consider paths of length up to 2 and compute the δ matrix in the following manner: this means that two documents are considered close in the network if they are direct neighbors or share at least one neighbor. note that this matrix is the binarized version of the matrix tadw factorizes. optimizing the log-likelihood. we perform mini-batch sgd with the adam [12] update rule. because most document networks are sparse, rather than uniformly sampling entries of δ, we sample 5000 balanced mini-batches in order to favor convergence. we sample 16 we consider 4 networks of documents of various nature: -a well-known scientific citation network extracted from cora 2 . each document is an article labelled with a conference. -new york times (nyt) titles of articles from january 2007. articles are linked according to common tags (e.g. business, arts, technology) and are labeled with the section they appear in (e.g. opinion, news). this network is particularly dense and documents have a short length. -two networks of the q&a website stack exchange (se) 3 from june 2019, namely gaming.stackexchange.com and travel.stackexchange.com. we only keep questions with at least 10 user votes and that have at least one answer with 10 user votes or more. we build the network by linking questions with their answers and by linking questions and answers of the same user. the labels are the tags associated with each question (table 1) . for each network, we consider a traditional classification tasks, an inductive classification task and an inductive link prediction task. -the traditional task refers to a setting where the model is trained on the entire network and the learned representations are used as features for a one-vs-all linear classifier with a training set of labelled documents ranging from 2% to 10% for multi-class networks and from 10% to 50% for multi-label networks. -the inductive tasks refer to a setting where 10% of the documents are removed from the network and the model is trained on the resulting sub-network. for the classification task, a linear classifier is trained with the representations and the labels of the observed documents. representations for hidden documents are then generated in an inductive manner, using their textual content only. classifications and link predictions are then performed on these induced representations. to classify the learned representations, we use the liblinear [8] logistic regression [14] algorithm and we cross validate the regularization parameter for each dataset and each model. every experiment is repeated 10 times and we report the micro average of the area under the roc curve (auc). the auc uses the probabilities of the logistic regression for all classes and evaluates the quality of the resulting ranking given the true labels. this metric is thus suitable for information retrieval tasks where we want to penalize wrong predictions depending on their ranks. for link prediction, we rank pairs of documents according to the cosine similarity between their representations. for all document networks, we process the documents by tokenizing text into words, discarding punctuation, stop words and words that appear less than 5 times or in more than 25% of the documents. we create document-term matrices that are used as input for 6 algorithms. our baselines are representative of the different approaches for document ne. tadw and gvnr-t are based on matrix factorization whereas cane and g2g are deep learning models. for each of them, we used the implementations of the authors: -lsa: we use a 256-dimensional svd decomposition of the tf-idf vectors as a text-only baseline; -tadw: we follow the guidelines of the original paper by using 20 iterations and a penalty term λ = 0.2. for induction, we generate a document vector by computing the textual component ht in eq. 1; -graph2gauss (g2g): we make sure the loss function converges before the maximum number of iterations; -gvnr-t: we use γ = 10 random walks of length t = 40, a sliding window of size l = 5 and a threshold x min = 5 with 1 iteration. for induction, we compute δj w |δj |1 in eq. 2; -cane: we use the same parameters as in the original paper; -idne: we run all experiments with n t = 32 topic vectors. the effect of n t is discussed in sect. 4.6. tables 2 and 3 detail the auc scores on the traditional classification task. we report the results for cane only for cora since the algorithm did not terminate within 10 h for the other networks. in comparison, our method takes about 5 min to run on each network on a regular laptop. the classifier performs well on the representations we learned, achieving similar or better results than the baseline algorithms on cora, gaming and travel stack exchange. however, regarding the new york times network, gvnr-t and tadw have a slight advantage. because of its high density, the links in this network are little informative which may explain the relative good scores of the lsa representations. we hypothesize that (1) tadw benefits from its input lsa features and that (2) gvnr-t benefits both from its random walk based matrix of node co-occurrences [20] , which captures more precisely the proximities of the nodes in such dense network, and from the short length of the documents making the word embedding averaging efficient [1, 15] . table 4 shows the auc scores in the inductive settings. for link prediction idne performs best on three networks, showing its capacity to learn meaningful word and topic representations according to the network structure. for classification, lsa and gvnr-t achieve the best results while idne reaches similar but slightly lower scores on all datasets. on the contrary, tadw and graph2gauss show weaknesses on nyt and gaming se. in summary, idne shows constant performances across all settings where other methods lack of robustness against the type of network or the type of task. a surprising result is the good scores of gvnr-t for inductive classification which we didn't expect given that its textual component only is used for this setting. however, for the traditional classification, gvnr-t has difficulties to handle networks with longer documents. idne does not suffer the same problem because twa carefully select discriminative words before averaging them. in sect. 5, we further show that idne learns meaningful representations of words and topics and builds interpretable document representations. figure 2 shows the impact of the number of topic vectors n t and of the number of steps (mini-batches) on the auc scores obtained in traditional classification with cora. note that we observe a similar behavior on the other networks. we see that the scores improve from 1 to 16 topics and tend to stagnate for upper values. in a similar manner, performances improve up to 5000 iterations after which no increase is observed. we first show in sect. 5.1 that idne is capable of learning meaningful word and topic vectors. then, we provide visualizations of documents that highlight the ability of the topic-word attention to reveal topics of interest. for all experiments, we set the number of topics to n t = 6. table 5 shows the closest words to each topic, computed as the dot product between their respective vectors, learned on cora. word and topic vectors are trained to predict the proximity of the nodes in a network, meaningless words are thus always dissimilar to the topic vectors, since they do not help to predict a link. this can be verified by observing the words that have the largest and the smallest norms, also reported in table 5 . even though the topics are learned in an unsupervised manner, we notice that, when we set the number of topics close to the number of classes, each topic seems to capture the semantics of one particular class. to further highlight the ability of our model to bring interpretability, we show in fig. 3 the topics that most likely generated the words of a document according to twa. the document is the abstract of this paper whose weights are inductively calculated with idne previously trained on cora. we compute its attention weights z i and associate each word k to the maximum value of its column z i k . we then colorize and underline each word associated to the two most represented topics in the document, if its weight is higher than 1 2 . we see that the major topic (green and single underline), that accounts for 32% of the weights, deals with the type of data, here document networks. the second topic (blue and double underline), which represents 18% of the weights, relates to text modeling, with words like "interpretable" and "topics". in this paper, we presented idne, an inductive document network embedding algorithm that learns word and latent topic representations via twa, a topicword attention mechanism able to produce interpretable document representations. we showed that idne performs state-of-the-art results on various network in different settings. moreover, we showed that our attention mechanism provides an efficient way of interpreting the learned representations. in future work, we would like to study the effect of the sampling of the documents on the learned topics. in particular, the matrix δ could capture other types of similarities between documents such as simrank [11] which measures structural relatedness between nodes instead of proximities. this could reveal complementary topics underlying a document network and could provide interpretable explanations of the roles played by documents in networks. a simple but tough-to-beat baseline for sentence embeddings deep gaussian embedding of graphs: unsupervised inductive learning via ranking global vectors for node representations link prediction with mutual attention for textattributed networks relational topic models for document networks indexing by latent semantic analysis bert: pre-training of deep bidirectional transformers for language understanding liblinear: a library for large linear classification node2vec: scalable feature learning for networks inductive representation learning on large graphs simrank: a measure of structural-context similarity adam: a method for stochastic optimization variational graph auto-encoders logistic regression. statistics for biology and health distributed representations of sentences and documents set transformer neural word embedding as implicit matrix factorization effective approaches to attention-based neural machine translation distributed representations of words and phrases and their compositionality the pagerank citation ranking: bringing order to the web glove: global vectors for word representation deepwalk: online learning of social representations autoencoding variational inference for topic models verse: versatile graph embeddings from similarity measures cane: context-aware network embedding for relation modeling attention is all you need structural deep network embedding network representation learning with rich text information key: cord-020815-j9eboa94 authors: kamphuis, chris; de vries, arjen p.; boytsov, leonid; lin, jimmy title: which bm25 do you mean? a large-scale reproducibility study of scoring variants date: 2020-03-24 journal: advances in information retrieval doi: 10.1007/978-3-030-45442-5_4 sha: doc_id: 20815 cord_uid: j9eboa94 when researchers speak of bm25, it is not entirely clear which variant they mean, since many tweaks to robertson et al.’s original formulation have been proposed. when practitioners speak of bm25, they most likely refer to the implementation in the lucene open-source search library. does this ambiguity “matter”? we attempt to answer this question with a large-scale reproducibility study of bm25, considering eight variants. experiments on three newswire collections show that there are no significant effectiveness differences between them, including lucene’s often maligned approximation of document length. as an added benefit, our empirical approach takes advantage of databases for rapid ir prototyping, which validates both the feasibility and methodological advantages claimed in previous work. bm25 [8] is perhaps the most well-known scoring function for "bag of words" document retrieval. it is derived from the binary independence relevance model to include within-document term frequency information and document length normalization in the probabilistic framework for ir [7] . although learning-to-rank approaches and neural ranking models are widely used today, they are typically deployed as part of a multi-stage reranking architecture, over candidate documents supplied by a simple term-matching method using traditional inverted indexes [1] . often, this is accomplished using bm25, and thus this decades-old scoring function remains a critical component of search applications today. as many researchers have previously observed, e.g., trotman et al. [11] , the referent of bm25 is quite ambiguous. there are, in fact, many variants of the scoring function: beyond the original version proposed by robertson et al. [8] , many variants exist that include small tweaks by subsequent researchers. also, researchers using different ir systems report (sometimes quite) different effectiveness measurements for their implementation of bm25, even on the same test collections; consider for example the results reported in osirrc 2019, the open-source ir replicability challenge at sigir 2019 [2] . furthermore, bm25 is parameterized in terms of k 1 and b (plus k 2 , k 3 in the original formulation), and researchers often neglect to include the parameter settings in their papers. our goal is a large-scale reproducibility study to explore the nuances of different variants of bm25 and their impact on retrieval effectiveness. we include in our study the specifics of the implementation of bm25 in the lucene open-source search library, a widely-deployed variant "in the real world". outside of a small number of commercial search engine companies, lucene-either stand-alone or via higher-level platforms such as solr and elasticsearch-has today become the de facto foundation for building search applications in industry. our approach enlists the aid of relational databases for rapid prototyping, an idea that goes back to the 1990s and was more recently revived by mühleisen et al. [6] . adding or revising scoring functions in any search engine requires custom code within some framework for postings traversal, making the exploration of many different scoring functions (as in our study) a tedious and error-prone process. as an alternative, it is possible to "export" the inverted index to a relational database and recast the document ranking problem into a database (specifically, sql) query. varying the scoring function, then, corresponds to varying the expression for calculating the score in the sql query, allowing us to explore different bm25 variants by expressing them declaratively (instead of programming imperatively). we view our work as having two contributions: -we conducted a large-scale reproducibility study of bm25 variants, focusing on the lucene implementation and variants described by trotman et al. [11] . their findings are confirmed: effectiveness differences in ir experiments are unlikely to be the result of the choice of bm25 variant a system implemented. -from the methodological perspective, our work can be viewed as reproducing and validating the work of mühleisen et al. [6] , the most recent advocate of using databases for rapid ir prototyping. robertson et al. is negative when df t > n/2, lucene adds a constant one before calculating the log value. second, the document length used in the scoring function is compressed (in a lossy manner) to a one byte value, denoted l dlossy . with only 256 distinct document lengths, lucene can pre-compute the value of k 1 · (1 − b + b · (l dlossy /l avg )) for each possible length, resulting in fewer computations at query time. lucene (accurate) represents our attempt to measure the impact of lucene's lossy document length encoding. we implemented a variant that uses exact document lengths, but is otherwise identical to the lucene default. atire [10] implements the idf component of bm25 as log (n/df t ), which also avoids negative values. the tf component is multiplied by k 1 + 1 to make it look more like the classic rsj weight; this has no effect on the resulting ranked list, as all scores are scaled linearly with this factor. bm25l [5] builds on the observation that bm25 penalizes longer documents too much compared to shorter ones. the idf component differs, to avoid negative values. the tf component is reformulated as the c td component is further modified by adding a constant δ to it, boosting the score for longer documents. the authors report using δ = 0.5 for highest effectiveness. [4] encodes a general approach for dealing with the issue that ranking functions unfairly prefer shorter documents over longer ones. the proposal is to add a lower-bound bonus when a term appears at least one time in a document. the difference with bm25l is a constant δ to the tf component. the idf component is again changed to a variant that disallows negative values. [3] is an approach that varies k 1 per term (i.e., uses term specific k 1 values). in order to determine the optimal value for k 1 , the method starts by identifying the probability of a term occurring at least once in a document as (df r + 0.5)/(n + 1). the probability of the term occurring one more time is then defined as (df r+1 + 0.5)/(df r + 1). the information gain of a term occurring r + 1 instead of r times is defined as g r q = log 2 ((df r+1 + 0.5)/(df r + 1)) − log 2 ((df tr + 0.5)/(n + 1)), where df r is defined as follows: |d t|c td ≥r−0.5 | if r > 1, df t if r = 1, and n if r = 0 (c td is the same as in bm25l). the information gain is calculated for r ∈ {0, . . . , t }, until g r q > g r+1 q . the optimal value for k 1 is then determined by finding the value for k 1 that minimizes the equation essentially, this gives a value for k 1 that maximizes information gain for that specific term; k 1 and g 1 q are then plugged into the bm25-adpt formula. we found that the optimal value of k 1 is actually not defined for about 90% of the terms. a unique optimal value for k 1 only exists when r > 1 while calculating g r q . for many terms, especially those with a low df , g r q > g r+1 q occurs before r > 1. in these cases, picking different values for k 1 has virtually no effect on retrieval effectiveness. for undefined values, we set k 1 to 0.001, the same as trotman et al. [11] . tf l•δ•p ×idf [9] models the non-linear gain of a term occurring multiple times in a document as 1 + log (1 + log (tf td )). to ensure that terms occurring at least once in a document get boosted, the approach adds a fixed component δ, following bm25+. these parts are combined into the tf component using tf td /(1 − b + b · (l d /l avg )). the same idf component as in bm25+ is used. our experiments were conducted using anserini (v0.6.0) on java 11 to create an initial index, and subsequently using relational databases for rapid prototyping, which we dub "olddog" after mühleisen et al. [6] ; following that work we use monetdb as well. evaluations with lucene (default) and lucene (accurate) were performed directly in anserini; the latter was based on previously-released code that we updated and incorporated into anserini. 2 the inverted index was exported from lucene to olddog, ensuring that all experiments share exactly the same document processing pipeline (tokenization, stemming, stopword removal, etc.). while exporting the inverted index, we precalculate all k 1 values for bm25adpt as suggested by lv and zhai [3] . as an additional verification step, we implemented both lucene (default) and lucene (accurate) in olddog and compared results to the output from anserini. we are able to confirm that the results are the same, setting aside unavoidable differences related to floating point precision. all bm25 variants are then implemented in olddog as minor variations upon the original sql query provided in mühleisen et al. [6] . the term-specific parameter optimization for the adpt variant was already calculated during the index extraction stage, allowing us to upload the optimal (t, k) pairs and directly use the term-specific k values in the sql query. the advantage of our experimental methodology is that we did not need to implement a single new ranking function from scratch. all the sql variants implemented for this paper can be found on github. 3 the experiments use three trec newswire test collections: trec disks 4 and 5, excluding congressional record, with topics and relevance judgments from the trec 2004 robust track (robust04); the new york times annotated corpus, with topics and relevance judgments from the trec 2017 common core track (core17); the trec washington post corpus, with topics and relevance judgments from the trec 2018 common core track (core18). following standard experimental practice, we assess ranked list output in terms of average precision (ap) and precision at rank 30 (p@30). the parameters shared by all models are set to k 1 = 0.9 and b = 0.4, anserini's defaults. the parameter δ is set to the value reported as best in the corresponding source publication. table 2 presents the effectiveness scores for the implemented retrieval functions on all three test collections. all experiments were run on a linux desktop (fedora 30, kernel 5.2.18, selinux enabled) with 4 cores (intel xeon cpu e3-1226 v3 @ 3.30 ghz) and 16 gb of main memory; the monetdb 11.33.11 server was compiled from source using the --enable-optimize flag. table 3 presents the average retrieval time per query in milliseconds (without standard deviation for anserini, which does not report time per query). monetdb uses all cores for both inter-and intraquery parallelism, while anserini is single-threaded. the observed differences in effectiveness are very small and can be fully attributed to variations in the scoring function; our methodology fixes all other parts of the indexing pipeline (tag cleanup, tokenization, stopwords, etc.). both an anova and tukey's hsd show no significant differences between any variant, on all test collections. this confirms the findings of trotman et al. [11] : across the ir literature, we find that differences due to more mundane settings (such as the choice of stopwords) are often larger than the differences we observe here. although we find no significant improvements over the original robertson et al. [8] formulation, it might still be worthwhile to use a variant of bm25 that avoids negative ranking scores. comparing lucene (default) and lucene (accurate), we find negligible differences in effectiveness. however, the differences in retrieval time are also negligible, which calls into question the motivation behind the original length approximation. currently, the similarity function and thus the document length encoding are defined at index time. storing exact document lengths would allow for different ranking functions to be swapped at query time more easily, as no information would be discarded at index time. accurate document lengths might additionally benefit downstream modules that depend on lucene. we therefore suggest that lucene might benefit from storing exact document lengths. in summary, this work describes a double reproducibility study-we methodologically validate the usefulness of databases for ir prototyping claimed by mühleisen et al. [6] and performed a large-scale study of bm25 to confirm the findings of trotman et al. [11] . returning to our original motivating question regarding the multitude of bm25 variants: "does it matter?", we conclude that the answer appears to be "no, it does not". effectiveness/efficiency tradeoffs for candidate generation in multi-stage retrieval architectures ceur workshop proceedings of the open-source ir replicability challenge (osirrc 2019) at sigir 2009 adaptive term frequency normalization for bm25 lower-bounding term frequency normalization when documents are very long old dogs are great at new tricks: column stores for ir prototyping the probabilistic relevance framework: bm25 and beyond okapi at trec-3 composition of tf normalizations: new insights on scoring functions for ad hoc ir towards an efficient and effective search engine improvements to bm25 and language models examined acknowledgements. this work is part of the research program commit2data with project number 628.011.001, which is (partly) financed by the nwo. additional support was provided by the natural sciences and engineering research council (nserc) of canada. key: cord-020888-ov2lzus4 authors: formal, thibault; clinchant, stéphane; renders, jean-michel; lee, sooyeol; cho, geun hee title: learning to rank images with cross-modal graph convolutions date: 2020-03-17 journal: advances in information retrieval doi: 10.1007/978-3-030-45439-5_39 sha: doc_id: 20888 cord_uid: ov2lzus4 we are interested in the problem of cross-modal retrieval for web image search, where the goal is to retrieve images relevant to a text query. while most of the current approaches for cross-modal retrieval revolve around learning how to represent text and images in a shared latent space, we take a different direction: we propose to generalize the cross-modal relevance feedback mechanism, a simple yet effective unsupervised method, that relies on standard information retrieval heuristics and the choice of a few hyper-parameters. we show that we can cast it as a supervised representation learning problem on graphs, using graph convolutions operating jointly over text and image features, namely cross-modal graph convolutions. the proposed architecture directly learns how to combine image and text features for the ranking task, while taking into account the context given by all the other elements in the set of images to be (re-)ranked. we validate our approach on two datasets: a public dataset from a mediaeval challenge, and a small sample of proprietary image search query logs, referred as webq. our experiments demonstrate that our model improves over standard baselines. this paper considers the typical image search scenario, where a user enters a text query, and the system returns a set of ranked images. more specifically, we are interested in re-ranking a subset of candidate images retrieved from the whole image collection by an efficient base ranker, following standard multi-stage ranking architectures in search engines [36] . directly including visual features in the ranking process is actually not straightforward due to the semantic gap between text and images: this is why the problem has initially been addressed using standard text-based retrieval, relying for instance on text crawled from the image's webpage (e.g. surrounding text, title of the page etc.). in order to exploit visual information, and therefore improve the quality of the results -especially because this text is generally noisy, and hardly describes the image semantic-, many techniques have been developed since. for instance, some works have focused on building similarity measures by fusing mono-modal similarities, using either simple combination rules, or more complex propagation mechanisms in similarity graphs. more recently, techniques have emerged from the computer vision community, where text and images are embedded in the same latent space (a.k.a. joint embedding), allowing to directly match text queries to images. the latter are currently considered as state-of-the-art techniques for the cross-modal retrieval task. however, they are generally evaluated on artificial retrieval scenarios (e.g. on mscoco dataset [34] ), and rarely considered in a re-ranking scenario, where mechanisms like pseudo-relevance feedback (prf) [31] are highly effective. we propose to revisit the problem of cross-modal retrieval in the context of re-ranking. our first contribution is to derive a general formulation of a differentiable architecture, drawing inspiration from cross-modal retrieval, learning to rank, neural information retrieval and graph neural networks. compared to joint embedding approaches, we tackle the problem in a different view: instead of learning new (joint) embeddings, we focus on designing a model that learns to combine information from different modalities. finally, we validate our approach on two datasets, using simple instances of our general formulation, and show that the approach is not only able to reproduce prf, but actually outperform it. cross-modal retrieval. in the literature, two main lines of work can be distinguished regarding cross-modal retrieval: the first one focuses on designing effective cross-modal similarity measures (e.g. [2, 10] ), while the second seeks to learn how to map images and text into a shared latent space (e.g. [15, 18, 19, 54] ). the first set of approaches simply combines different mono-media similarity signals, relying either on simple aggregation rules, or on unsupervised crossmodal prf mechanisms, that depend on the choice of a few but critical hyperparameters [2, 10, 11, 45] . as it will be discussed in the next section, the latter can be formulated as a two-step prf propagation process in a graph, where nodes represent multi-modal objects and edges encode their visual similarities. it has been later extended to more general propagation processes based on random walks [28] . alternatively, joint embedding techniques aim at learning a mapping between textual and visual representations [15, 18, 19, 23, [52] [53] [54] [55] 61] . canonical correlation analysis (cca) [17] and its deep variants [5, 27, 58] , as well as bi-directional ranking losses [8, 9, 52, 53, 55, 61] (or triplet losses) ensure that, in the new latent space, an image and its corresponding text are correlated or close enough w.r.t. to the other images and pieces of text in the training collection. other objective functions utilize metric learning losses [35] , machine translation-based measures [44] or even adversarial losses [51] . these approaches suffer from several limitations [61] : they are sensitive to the triplet sampling strategy as well as the choice of appropriate margins in the ranking losses. moreover, constituting a training set that ensures good learning and generalization is not an easy task: the text associated to an image should describe its visual content (e.g. "a man speaking in front of a camera in a park"), and nothing else (e.g. "the president of the us, the 10th of march", "john doe", "joy and happiness"). building a universal training collection of paired (image, text) instances, where text describes faithfully the content of the image in terms of elementary objects and their relationships, would be too expensive and time-consuming in practice. consequently, image search engines rely on such pairs crawled from the web, where the link between image and text (e.g. image caption, surrounding sentences etc.) is tenuous and noisy. to circumvent this problem, query logs could be used but, unfortunately -and this is our second argument regarding the limitations-, real queries are never expressed in the same way as the ones considered when evaluating joint embedding methods (e.g. artificial retrieval setting on mscoco [34] or flickr-30k [43] datasets, where the query is the full canonical textual description of the image). in practice, queries are characterised by very large intent gaps: they do not really describe the content of the image but, most of the time, contain only a few words, and are far from expressing the true visual needs. what does it mean to impose close representations for all images representing "paris" (e.g. "the eiffel tower", "louvre museum"), even if they can be associated to the same textual unit? neural information retrieval. neural networks, such as ranknet and lambdarank, have been intensively used in ir to address the learning to rank task [7] . more recently, there has been a growing interest in designing effective ir models with neural models [1, 12, 13, 20, 25, 26, 37, 38, 41, 56] , by learning the features useful for the ranking task directly from text. while standard strategies focus on learning a global ranking function that considers each query-document pair in isolation, they tend to ignore the difference in distribution in the feature space for different queries [4] . hence, some recent works have been focusing on designing models that exploit the context induced by the re-ranking paradigm, either by explicitly designing differentiable prf models [32, 40] , or by encoding the ranking context -the set of elements to re-rank-, using either rnns [4] or attention mechanisms [42, 62] . consequently, the score for a document takes into account all the other documents in the candidate list. because of their resemblance with structured problems, this type of approaches could benefit from the recent body of work around graph neural networks, which operate on graphs by learning how to propagate information to neighboring nodes. graph neural networks. graph neural networks (gnns) are extensions of neural networks that deal with structured data encoded as a graph. recently, graph convolutional networks (gcns) [30] have been proposed for semisupervised classification of nodes in a graph. each layer of a gcn can generally be decomposed as: (i) node features are first transformed (e.g. linear mapping), (ii) node features are convolved, meaning that for each node, a differentiable, permutation-invariant operation (e.g. sum, mean, or max) of its neighbouring node features is computed, before applying some non-linearity, (iii) finally, we obtain a new representation for each node in the graph, which is then fed to the next layer. many extensions of gcns have been proposed (e.g. graphsage [21] , graph attention network [50] , graph isomorphism network [57] ), some of them directly tackling the recommendation task (e.g. pinsage [59] ). but to the best of our knowledge, there is no prior work on using graph convolutions for the (re-)ranking task. our goal is to extend and generalize simple yet effective unsupervised approaches which have been proposed for the task [2, 3, 10, 11, 45] , that can be seen as an extension of pseudo-relevance feedback methods for multi-modal objects. let d ∈ d denote a document to re-rank, composed of text and image. we denote by s v (., .) a normalized similarity measure between two images, and by s t (q, d) the textual relevance score of document d w.r.t. query q. the cross-modal similarity score is given by: where nn k t (q) denotes the set of k most relevant documents w.r.t. q, based on text, i.e. on s t (q, .). the model can be understood very simply: similarly to prf methods in standard information retrieval, the goal is to boost images that are visually similar to top images (from a text point of view), i.e. images that are likely to be relevant to the query but were initially badly ranked (which is likely to happen in the web scenario, where text is crawled from source page and can be very noisy). despite showing good empirical results, cross-modal similarities are fully unsupervised, and lack some dynamic behaviour, like being able to adapt to different queries. moreover, they rely on a single relevance score s t (q, .), while it could actually be beneficial to learn how to use a larger set of features such as the ones employed in learning to rank models. in [3] , the authors made a parallel between the cross-modal similarity from eq. (1) and random walks in graphs: it can be seen as a kind of multimodal label propagation in a graph. this motivates us to tackle the task using graph convolutions. we therefore represent each query q ∈ q as a graph g q , as follows: -the set of nodes is the set of candidate documents d i to be re-ranked for this query: typically from a few to hundreds of documents, depending on the query. -each node i is described by a set of n learning to rank features x q,di ∈ r n . v i ∈ r d denotes the (normalized) visual embedding for document d i . -as we do not have an explicit graph structure, we consider edges given by a k-nearest neighbor graph, based on a similarity between the embeddings v i 1 . -we denote by n i the neighborhood of node i, i.e. the set of nodes j such that there exists an edge from j to i. -we consider edge weights, given by a similarity function between the visual features of its two extremity nodes our goal is to learn how to propagate features in the above graph. generalizing convolution operations to graphs can generally be expressed as a message passing scheme [16] : where γ and φ denote differentiable functions, e.g. mlps (multi layer perceptron). by choosing φ(h this graph convolution can be reduced to the cross-modal similarity in eq. (1). indeed, assuming that d) , and n i := n is the whole set of candidates to re-rank, then: in other words, one layer defined with eq. (3) includes the standard cross-modal relevance feedback as a special case. equation (3) is more general, and can easily be used as a building block in a differentiable ranking architecture. in the following, we derive a simple convolution layer from eq. (3), and we introduce the complete architecture -called dcmm for differentiable cross-modal model-, summarized in fig. 1 . learning to rank features x q,di are first encoded with an mlp(.;θ) with relu activations, in order to obtain node features h 0 i . then, the network splits into two branches: -the first branch simply projects linearly each h i , that acts as a pure text-based score 2 . -the second branch is built upon one or several layer(s) of cross-modal convolution, simply defined as: for the edge function g, we consider two cases: the cosine similarity g cos , defining the first model (referred as dcmm-cos), and a simple learned similarity measure parametrized by a vector a such that after the convolution(s), the final embedding for each node h (l) i is projected to a real-valued score s conv (q, d i ), using either a linear layer (s conv (q, , ω) ). finally, the two scores are combined to obtain the final ranking score: the model is trained using backpropagation and any standard learning to rank loss: pointwise, pairwise or listwise. it is worth to remark that, by extending prf mechanisms for cross-modal re-ranking, our model is actually closer to listwise context-based models introduced in sect. 2 than current state-of-the-art cross-modal retrieval models. it is listwise by design 3 : an example in a batch is not a single image in isolation, but all the candidate images for a given query, encoded as a graph, that we aim to re-rank together in a one shot manner. in our experiments, we used the pairwise bpr loss [46] , from which we obtained the best results 4 . let's consider a graph (i.e. the set of candidate documents for query q) in the batch, and all the feasible pairs of documents d +,− q for this query (by feasible, we mean all the pairs that can be made from positive and negative examples in the graph). then the loss is defined: note that contrary to previous works on listwise context modeling, we consider a set of objects to re-rank, and not a sequence (for instance in [4] , a rnn encoder is learned for re-ranking). in other words, we discard the rank information of the first ranker into the re-ranking process: we claim that the role of the first retriever is to be recall-oriented, and not precision-oriented. thus, using initial order might be a too strong prior, and add noise information. moreover, in the case of implicit feedback (clicks used as weak relevance signals), using rank information raises the issue of biased learning to rank (sensitivity to position and trust biases). it is also worth to emphasize that, contrary to most of the works around graph convolution models, our graph structure is somehow implicit: while edges between nodes generally indicate a certain relationship between nodes (for instance, connection between two users in a social network), in our case a connection represents the visual similarity between two nodes. in the following, we introduce the two datasets we used to validate our approach -a public dataset from a mediaeval 5 challenge, and an annotated set of queries sampled from image search logs of naver, the biggest commercial search engine in korea-, as well as our experimental strategy. we emphasize on the fact that we restrict ourselves to two relatively small datasets and few features as input for the models. even though the formulation from eq. (3) is very general, our claim is that a simple model, i.e. containing few hundreds to thousands parameters, should be able to reproduce prf mechanisms introduced in sect. 3. when adapting the approach to larger datasets, the model capacity can be adjusted accordingly, in order to capture more complex relevance patterns. note that we did not consider in our study standard datasets generally used to train joint embeddings such as mscoco [34] or flickr30k [43] , because the retrieval scenario is rather artificial, compared to web search: there are no explicit queries, and a text is only relevant to a single image. furthermore, we have tried to obtain the clickture [24] dataset without success 6 , and therefore cannot report on it. mediaeval. we first conduct experiments on the dataset from the "mediae-val17, retrieving diverse social images task" challenge 7 . while this challenge also had a focus on diversity aspects, we solely consider the standard relevance ranking task. the dataset is composed of a ranked list of images (up to 300) for each query, retrieved from flickr using its default ranking algorithm. the queries are general-purpose queries (e.g. q = autumn color ), and each image has been annotated by expert annotators (binary label, i.e. relevant or not). the goal is to refine the results from the base ranking. the training set contains 110 queries for 33340 images, while the test set contains 84 queries for 24986 images. while we could consider any number of learning to rank features as input for our model, we choose to restrict ourselves to a very narrow set of weak relevance signals, in order to remain comparable to its unsupervised counterpart, and ensure that the gain does not come from the addition of richer features. hence, we solely rely on four relevance scores, namely tf-idf, bm25, dirichlet smoothed lm [60] and desm score [39] , between the query and each image's text component (the concatenation of the image title and tags). we use an inception-resnet model [48] pre-trained on imagenet to get the image embeddings (d = 1536). webq. in order to validate our approach on a real world dataset, we sample a set of 1000 queries 8 from the image search logs of naver. all images appearing in the top-50 candidates for these queries within a period of time of two weeks have been labeled by three annotators in terms of relevance to the query (binary label). because of different query characteristics (in terms of frequency, difficulty etc.), and given the fact that new images are continuously added to/removed from the index, the number of images per query in our sample is variable (from around ten to few hundreds). note that, while we actually have access to a much larger amount of click logs, we choose to restrict the experiments to this small sample in order keep the evaluations simple. our goal here is to show that we are able to learn and reproduce some prf mechanisms, without relying on large amount of data. moreover, in this setting, it is easier to understand model's behaviour, as we avoid to deal with click noise and position bias. after removing queries without relevant images (according to majority voting among the three annotators), our sample includes 952 queries, and 43064 images, indexed through various text fields (title of the page, image caption etc.). we select seven of such fields, that might contain relevant pieces of information, and for which we compute two simple relevance features w.r.t. query q: bm25 and desm [39] (using embeddings trained on a large query corpus from an anterior period). we also add an additional feature, which is a mixture of the two above, on the concatenation of all the fields. image embeddings (d = 2048) are obtained using a resnet-152 model [22] pre-trained on imagenet. given the limited number of queries in both collections, we conducted 5-fold cross-validation, by randomly splitting the queries into five folds. the model is trained on 4 folds (with 1 fold kept for validation, as we use early stopping on ndcg), and evaluated on the remaining one; this procedure is repeated 5 times. then, the average validation ndcg is used to select the best model configuration. note that for the mediaeval dataset, we have access to a separate test set, so we modify slightly the evaluation methodology: we do the above 5fold cross-validation on the training set, without using a validation fold (hence, we do not use early stopping, and the number of epochs is a hyperparameter to tune). once the best model has been selected with the above strategy, we re-train it on the full training set, and give the final performance on the test set. we report the ndcg, map, p@20, and ndcg@20 for both datasets. we train the models using stochastic gradient descent with the adam optimizer [29] . we set the batch size (i.e. number of graphs per batch) to mediaeval, we also tune the number of epochs ∈ {50, 100, 200, 300, 500}, while for webq, we set it to 500, and use early stopping with patience set to 80. all node features are query-level normalized (mean-std normalization). the models are implemented using pytorch and pytorch geometric 9 [14] for the message passing components. in order to be fair, we want to compare methods with somewhat similar feature sets. obviously, for the supervised methods, results can be improved by either adding richer/more features, or increasing models' capacity. for both datasets, we compare our dcmm model to the following baselines: -a learning to rank model only based on textual features (ltr). -the cross-modal similarity introduced in sect. 3.1 [2, 3, 10, 11, 45] (cm). -the above ltr model with the cross-modal similarity as additional input feature (ltr+cm), to verify that it is actually beneficial to learn the crossmodal propagation in dcmm in a end-to-end manner. for the cross-modal similarity, we use as proxy for s t (q, .) a simple mixture of term-based relevance score (dirichlet-smoothed lm and bm25 for respectively mediaeval and webq) and desm score, on a concatenation of all text fields. from our experiments, we observe that it is actually beneficial to recombine the cross-modal similarity with the initial relevance s t (q, .), using a simple mixture. hence, three parameters are tuned (the two mixture parameters, and the number of neighbors for the query), following the evaluation methodology introduced in sect. 4.2 10 . the ltr models are standard mlps: they correspond to the upper part of architecture fig. 1 (text branch), and are tuned following the same strategy. we do not compare our models with joint embedding approaches on those datasets for the reasons mentioned in sect. 2, but also due to our initial experiments on medieval which gave poor results. for the sake of illustration, on mediaeval, 64% of the queries have no lemmas in common with training queries (and 35% for webq): given the relatively small size of these datasets, the models cannot generalize to unseen queries. this illustrates an "extreme" example of the generalization issues -especially on tail queries-of joint embedding techniques. in the meantime, as our model is fed with learning to rank features, especially term-based relevance scores like bm25, it could be less sensitive to generalization issues, for instance on new named entities. however, we want to emphasize that both approaches are not antagonist, but can actually be complementary. as our model can be seen as an extension of listwise learning to rank for bi-modal objects (if edges are removed, the model reduces to a standard mlp-based learning to rank), it can take as input node features matching scores from joint embeddings models. the model being an extension of prf, we actually see the approaches at different stages of ranking. table 1 gathers the main results of our study. without too much surprise, going from pure text ranker to a model using both media types improves the results by a large margin (all the models are significantly better than the text-based ltr model, so we do not include these tests on table 1 for clarity). moreover, results indicate that combining initial features with the unsupervised cross-modal similarity in a ltr model allows to slightly improve results over the latter (not significantly though) for the mediaeval dataset, while it has no effect on webq: this is likely due to the fact that features are somehow redundant in our setting, because of how s t (q, .) is computed for the cross-modal similarity; the same would not hold if we would consider a richer set of features for the ltr models. furthermore, the dcmm-cos model outperforms all the baselines, with larger margins for mediaeval than for webq; the only significant result (p-value < 0.05) is obtained for the map on mediaeval. nevertheless, it shows that this simple architecture -the most straightforward extension of cross-modal similarity introduced in sect. 3.1-, with a handful of parameters (see table 1 ) and trained on small datasets, is able to reproduce prf mechanisms. interestingly, results tend to drop as we increase the number of layers (best results are obtained with a single convolution layer), no matter the number of neighbors chosen to define the visual graph. while it might be related to the relative simplicity of the model, it actually echoes common observations in prf models (e.g. [3] ): if we propagate too much, we also tend to diffuse information too much. similarly, we can also make a parallel with over-smoothing in gnns [33] , which might be more critical for prf, especially considering the simplicity of this model. the dcmm-edge shows interesting results: on webq, we manage to improve results significantly w.r.t. to cm sim, while on mediaeval, results are slightly worse than dcmm-cos (except for the map). it might be due to the fact that images in the latter are more alike to the ones used to train image signatures, compared to the (noisy) web images in webq; hence, learning a new metric between images has less impact. interestingly, for both datasets, best results are obtained with more than a single layer; we hypothesize that the edge function plays the role of a simple filter for edges, allowing to propagate information from useful nodes across more layers. note that the number of layers needed for the task is tied with how we define our input graph: the less neighbors we consider for each node, the more layers might be needed, in order for each node to gather information from useful nodes. in fig. 2 , we observe that if the number of neighbors is too small (e.g. 3 or 5), then the model needs more layers to improve performance. on the other side, when considering too many neighbors (e.g. 20 or all), the nodes already have access to all the useful neighbors, hence adding layers only reduces performances. we need to find the right balance between the number of neighbors and the number of convolution layers, so that the model can learn to propagate relevant signals (e.g. 10 neighbors and 3 layers for webq). in this paper, we have proposed a reformulation of unsupervised cross-modal prf mechanisms for image search as a differentiable architecture relying on graph convolutions. compared to its unsupervised counterpart, our novel approach can integrate any set of features, while providing a high flexibility in the design of the architecture. experiments on two datasets showed that a simple model derived from our formulation achieved comparable -or better-performance compared to cross-modal prf. there are many extensions and possible directions stemming from the relatively simple model we have studied. given enough training data (e.g. large amount of click logs), we could for instance learn to dynamically filter the visual similarity by using an attention mechanism to choose which nodes to attend, similarly to graph attention networks [50] and transformer model [49] , discarding the need to set the number of neighbors in the input graph. finally, our approach directly addressed the cross-modal retrieval task, but its application to the more general prf problem in ir remains possible. learning deep structured semantic models for web search using clickthrough data xrce's participation to imageclef unsupervised visual and textual information fusion in cbmir using graph-based methods learning a deep listwise context model for ranking refinement deep canonical correlation analysis revisiting approximate metric optimization in the age of deep neural networks from ranknet to lambdarank to lambdamart: an overview crossmodal retrieval in the cooking context: learning semantic text-image embeddings amc: attention guided multimodal correlation learning for image search trans-media pseudo-relevance feedback methods in multimedia retrieval unsupervised visual and textual information fusion in multimedia retrieval -a graph-based point of view convolutional neural networks for softmatching n-grams in ad-hoc search modeling diverse relevance patterns in ad-hoc retrieval fast graph representation learning with pytorch geometric devise: a deep visual-semantic embedding model neural message passing for quantum chemistry a multi-view embedding space for modeling internet images, tags, and their semantics improving imagesentence embeddings using large weakly annotated photo collections beyond instance-level image retrieval: leveraging captions to learn a global visual representation for semantic a deep relevance matching model for ad-hoc retrieval inductive representation learning on large graphs deep residual learning for image recognition scalable deep multimodal learning for crossmodal retrieval clickage: towards bridging semantic and intent gaps via mining click logs of search engines a position-aware deep model for relevance matching in information retrieval re-pacrr: a context and densityaware neural information retrieval model multi-view deep network for cross-view classification multi-modal image retrieval with random walk on multi-layer graphs adam: a method for stochastic optimization semi-supervised classification with graph convolutional networks relevance based language models nprf: a neural pseudo relevance feedback framework for ad-hoc information retrieval deeper insights into graph convolutional networks for semi-supervised learning microsoft coco: common objects in context deep coupled metric learning for crossmodal matching cascade ranking for operational e-commerce search an updated duet model for passage re-ranking learning to match using local and distributed representations of text for web search a dual embedding space model for document ranking task-oriented query reformulation with reinforcement learning deeprank: a new deep architecture for relevance ranking in information retrieval personalized context-aware re-ranking for e-commerce recommender systems flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models cross-modal bidirectional translation via reinforcement learning nle@mediaeval'17: combining cross-media similarity and embeddings for retrieving diverse social images bpr: bayesian personalized ranking from implicit feedback dropout: a simple way to prevent neural networks from overfitting inception-v4, inception-resnet and the impact of residual connections on learning attention is all you need graph attention networks adversarial cross-modal retrieval learning two-branch neural networks for image-text matching tasks learning deep structure-preserving image-text embeddings wsabie: scaling up to large vocabulary image annotation learning semantic structure-preserved embeddings for cross-modal retrieval end-to-end neural ad-hoc ranking with kernel pooling how powerful are graph neural networks? deep correlation for matching images and text graph convolutional neural networks for web-scale recommender systems a study of smoothing methods for language models applied to ad hoc information retrieval deep cross-modal projection learning for image-text matching a domain generalization perspective on listwise context modeling key: cord-020880-m7d4e0eh authors: barrón-cedeño, alberto; elsayed, tamer; nakov, preslav; da san martino, giovanni; hasanain, maram; suwaileh, reem; haouari, fatima title: checkthat! at clef 2020: enabling the automatic identification and verification of claims in social media date: 2020-03-24 journal: advances in information retrieval doi: 10.1007/978-3-030-45442-5_65 sha: doc_id: 20880 cord_uid: m7d4e0eh we describe the third edition of the checkthat! lab, which is part of the 2020 cross-language evaluation forum (clef). checkthat! proposes four complementary tasks and a related task from previous lab editions, offered in english, arabic, and spanish. task 1 asks to predict which tweets in a twitter stream are worth fact-checking. task 2 asks to determine whether a claim posted in a tweet can be verified using a set of previously fact-checked claims. task 3 asks to retrieve text snippets from a given set of web pages that would be useful for verifying a target tweet’s claim. task 4 asks to predict the veracity of a target tweet’s claim using a set of potentially-relevant web pages. finally, the lab offers a fifth task that asks to predict the check-worthiness of the claims made in english political debates and speeches. checkthat! features a full evaluation framework. the evaluation is carried out using mean average precision or precision at rank k for ranking tasks, and f[formula: see text] for classification tasks. the mission of the checkthat! lab is to foster the development of technology that would enable the automatic verification of claims. automated systems for claim identification and verification can be very useful as supportive technology for investigative journalism, as they could provide help and guidance, thus saving time [14, 22, 24, 33] . a system could automatically identify check-worthy claims, make sure they have not been fact-checked already by a reputable fact-checking organization, and then present them to a journalist for further analysis in a ranked list. additionally, the system could identify documents that are potentially useful for humans to perform manual fact-checking of a claim, and it could also estimate a veracity score supported by evidence to increase the journalist's understanding and the trust in the system's decision. checkthat! at clef 2020 is the third edition of the lab. 1 the 2018 edition [29] of checkthat! focused on the identification and verification of claims in political debates. 2 whereas the 2019 edition [9, 10] also focused on political debates, isolated claims were considered as well, in conjunction with a closed set of web documents to retrieve evidence from. 3 in 2020, checkthat! turns its attention to social media-in particular to twitter -as information posted on that platform is not checked by an authoritative entity before publication and such information tends to disseminate very quickly. moreover, social media posts lack context due to their short length and conversational nature; thus, identifying a claim's context is sometimes key for enabling effective fact-checking [7] . the lab is mainly organized around four tasks, which correspond to the four main blocks in the verification pipeline, as illustrated in fig. 1 . tasks 1, 3, and 4 can be seen as reformulations of corresponding tasks in 2019, which enables re-use of training data and systems from previous editions of the lab (cf. sect. 3). task 2 runs for the first time. while tasks 1-4 are focused on twitter, task 5 (not in fig. 1 ) focuses on political debates as in the previous two editions of the lab. all tasks are run in english. additionally, tasks 1, 3, and 4 are also offered in arabic and/or spanish. task 1 is formulated as follows: given a topic and a stream of potentially-related tweets, rank the tweets according to their check-worthiness for the topic. previous work on check-worthiness focused primarily on political debates and speeches, but here we focus on tweets instead. we include "topics" this year, as we want to have a scenario that is close to that from 2019; a topic gives a context just like a debate did. we construct the dataset by tracking a set of manually-created topics in twitter. a sample of tweets from the tracked stream (per topic) is shared with the participating systems as input for task 1. the systems are asked to submit a ranked list of the tweets for each topic. finally, using pooling, a set of tweets is selected and then judged by in-house annotators. evaluation. we treat task 1 as a ranking problem. systems are evaluated using ranking evaluation measures, namely mean average precision (map) and precision at rank k (p@k). the official measure is p@30. given an input claim c and a set v c = {v i } of verified claims, we consider each pair (c, v i ) as relevant if v i would save the process of verifying c from scratch, and as irrelevant otherwise. note that there might be more than one relevant verified claim per input claim, e.g., because the input claim might be composed of multiple claims. the task is similar to paraphrasing and textual similarity tasks, as well as to textual entailment [8, 12, 30] . dataset. verified claims are retrieved from fact-checking websites such as snopes and politifact. evaluation. mean average precision on the first 5 retrieved claims (map@5) is used to assess the quality of the rankings submitted by the participants. a perfect ranking will have on top all v i such that (c, v i ) is relevant, in any order, followed by all irrelevant claims. in addition to map@5, we also report mrr, map@k (k = 3, 10, 20, all) and recall@k for k = 3, 5, 10, 20 in order to provide participants with more information about their systems. task 3 is defined as follows: given a check-worthy claim on a specific topic and a set of text snippets extracted from potentially-relevant webpages, return a ranked list of all evidence snippets for the claim. evidence snippets are those snippets that are useful in verifying the given claim. dataset. while tracking on-topic tweets, we search the web to retrieve top-m web pages using topic-related queries. this would ensure the freshness of the retrieved pages and enable reusability of the dataset for real-time verification tasks. once we acquire annotations for task 1, we share with participants the web pages and text snippets from them solely for the check-worthy claims, which would enable the start of the evaluation cycle for task 3. in-house annotators will label each snippet as evidence or not for a target claim. evaluation. tasks 3 is a ranking problem. we evaluate the ranked list per topic using map and p@k. the official measure is p@10. task 4 is defined as follows: given a check-worthy claim on a specific topic and a set of potentially-relevant web pages, predict the veracity of the claim. this task closes the verification pipeline. dataset. the dataset for this task is the same as for task 3. the only difference is that the in-house annotators judge each claim as true or false. evaluation. task 4 is a binary classification problem. therefore, it is evaluated using standard classification evaluation measures: precision, recall, f 1 , and accuracy. the official measure is macro-averaged f 1 . task 5 is defined as follows: given a debate segmented into sentences, together with speaker information, prioritize sentences for fact-checking. this is a ranking task and each sentence should be associated with a score. dataset. this is the third iteration of this task. we believe it is important to keep it alive as we have a large body of annotated data already and new material arrives with the coming 2020 us presidential elections. evaluation. task 5 is yet another ranking problem. we use map as the official evaluation measure. we further report p@k for k ∈ {5, 10, 20, 50}. two editions of checkthat! have been held so far. while the datasets come from different genres, some of the tasks in the 2020 edition are reformulated. hence, considering some of the most successful approaches applied in the past represents a good starting point to address the current challenges. the 2019 edition featured two tasks [10] : task 1 2019 . given a political debate, interview, or speech, transcribed and segmented into sentences, rank the sentences by the priority with which they should be fact-checked. the most successful approaches used neural networks for the individual classification of the instances. for example, hansen et al. [19] learned domain-specific word embeddings and syntactic dependencies and applied an lstm classifier. using some external knowledge paid off-they pre-trained the network with previous trump and clinton debates, supervised weakly with the claimbuster system. some efforts were carried out in order to consider context. favano et al. [11] trained a feed-forward neural network, including the two previous sentences as context. whereas many approaches opted for embedding representations, feature engineering was also popular [13] . the systems for evidence passage identification followed two approaches. bert was trained and used to predict whether an input passage is useful to fact-check a claim [11] . other participating systems used classifiers (e.g., svm) with a variety of features including similarity between the claim and a passage, bag of words, and named entities [20] . as for predicting claim veracity, the most effective approach used a textual entailment model. the input was represented using word embeddings and external data was also used in training [15] . in the 2020 edition, task 1 2019 becomes task 5, and task 1 is a reformulation based on tweets (cf. sect. 2.1). see [2] for further details. task 2 2019 becomes tasks 3 and 4 (cf. sects. 2.3 and 2.4). see [21] for further details. the 2018 edition featured two tasks [29] : task 1 2018 was identical to task 1 2019 . the most successful approaches used either a multilayer perceptron or an svm. zuo et al. [36] enriched the dataset by producing pseudo-speeches as a concatenation of all interventions by a debater. they used averaged word embeddings and bag-of-words as representations. hansen et al. [18] represented the entries with embeddings, part of speech tags, and syntactic dependencies. they used a gru neural network with attention. see [1] for further details. task 2 2018 . given a check-worthy claim in the form of a (transcribed) sentence, determine whether the claim is likely to be true, half-true, or false. the best way to address this task was to retrieve relevant information from the web, followed by a comparison to the claim in order to assess its factuality. 4 after retrieving such evidence, it is fed into the supervised model, together with the claim in order to assess its veracity. in the case of [18] , they fed the claim and the most similar web-retrieved text to convolutional neural networks and svms. meanwhile, ghanem et al. [16] computed features, such as the similarity between the claim and the web text, and the alexa rank for the website. see [4] for further details. there has been work on checking the factuality/credibility of a claim, of a news article, or of an information source [3, 25, 26, 28, 31, 35] . claims can come from different sources, but special attention has been given to those from social media [17, 27, 32, 34] . check worthiness estimation is still a fairly-new problem especially in the context of social media [14, [22] [23] [24] . checkthat! further shares some aspects with other initiatives that have been run with high success in the past, e.g., stance detection (fake news 5 ), semantic textual similarity (sts at semeval 6 ), and community question answering (cqa at semeval 7 ). we have presented the 2020 edition of the checkthat! lab, which features tasks that span the full verification pipeline: from spotting check-worthy claims to checking whether they have been fact-checked elsewhere already, to retrieving useful passages within relevant pages, to finally making a prediction about the factuality of a claim. to the best of our knowledge, this is the first shared task that addresses all steps of the fact-checking process. moreover, unlike previous editions of the checkthat! lab, our main focus here is on social media, which are the center of "fake news" and disinformation. we further feature a more realistic information retrieval scenario with pooling for evaluation, as done at ir venues such as trec. last but not least, in-line with the general mission of clef, we promote multi-linguality by offering our tasks in different languages. we hope that these tasks and the associated datasets will serve the mission of the checkthat! initiative, which is to foster the development of datasets, tools and technology that would enable the automatic verification of claims and will support human fact-checkers in their fight against "fake news" and disinformation. overview of the clef-2018 checkthat! lab on automatic identification and verification of political claims. task 1: check-worthiness overview of the clef-2019 checkthat! lab on automatic identification and verification of claims. task 1: check-worthiness vera: a platform for veracity estimation over web data overview of the clef-2018 checkthat! lab on automatic identification and verification of political claims. task 2: factuality working notes of clef 2019 conference and labs of the evaluation forum. ceur workshop proceedings. ceur-ws.org working notes of clef 2018-conference and labs of the evaluation forum. ceur workshop proceedings. ceur-ws.org a content management perspective on fact-checking semeval-2017 task 1: semantic textual similarity multilingual and crosslingual focused evaluation checkthat! at clef 2019: automatic identification and verification of claims overview of the clef-2019 checkthat! lab: automatic identification and verification of claims theearthisflat's submission to clef'19 structural representations for learning relations between pairs of texts the ipipan team participation in the check-worthiness task of the clef2019 checkthat! lab a contextaware approach for detecting worth-checking claims in political debates upv-uma at checkthat! lab: verifying arabic claims using cross lingual approach upv-inaoe-autoritas -check that: preliminary approach for checking worthiness of claims tweetcred: real-time credibility assessment of content on twitter the copenhagen team participation in the check-worthiness task of the competition of automatic identification and verification of claims in political debates of the clef-2018 fact checking lab neural weakly supervised fact check-worthiness detection with contrastive sampling-based ranking loss bigir at clef 2019: automatic verification of arabic claims over the web overview of the clef-2019 checkthat! lab on automatic identification and verification of claims. task 2: evidence and factuality detecting check-worthy factual claims in presidential debates comparing automated factual claim detection against judgments of journalism organizations claimbuster: the first-ever end-to-end fact-checking system fully automated fact checking using external sources detecting rumors from microblogs with recurrent neural networks credbank: a large-scale social media corpus with associated credibility annotations leveraging joint interactions for credibility analysis in news communities overview of the clef-2018 lab on automatic identification and verification of claims in political debates semeval-2016 task 3: community question answering credibility assessment of textual claims on the web fake news detection on social media: a data mining perspective it takes nine to smell a rat: neural multi-task learning for check-worthiness prediction enquiring minds: early detection of rumors in social media from enquiry posts analysing how people orient to and spread rumours in social media by looking at conversational threads a hybrid recognition system for check-worthy claims using heuristics and supervised learning acknowledgments. the work of tamer elsayed and maram hasanain was made possible by nprp grant# nprp 11s-1204-170060 from the qatar national research fund (a member of qatar foundation). the work of reem suwaileh was supported by gsra grant# gsra5-1-0527-18082 from the qatar national research fund and the work of fatima haouari was supported by gsra grant# gsra6-1-0611-19074 from the qatar national research fund. the statements made herein are solely the responsibility of the authors. this research is also part of the tanbih project, developed by the qatar computing research institute, hbku and mit-csail, which aims to limit the effect of "fake news", propaganda, and media bias. key: cord-020916-ds0cf78u authors: fard, mazar moradi; thonet, thibaut; gaussier, eric title: seed-guided deep document clustering date: 2020-03-17 journal: advances in information retrieval doi: 10.1007/978-3-030-45439-5_1 sha: doc_id: 20916 cord_uid: ds0cf78u different users may be interested in different clustering views underlying a given collection (e.g., topic and writing style in documents). enabling them to provide constraints reflecting their needs can then help obtain tailored clustering results. for document clustering, constraints can be provided in the form of seed words, each cluster being characterized by a small set of words. this seed-guided constrained document clustering problem was recently addressed through topic modeling approaches. in this paper, we jointly learn deep representations and bias the clustering results through the seed words, leading to a seed-guided deep document clustering approach. its effectiveness is demonstrated on five public datasets. clustering traditionally consists in partitioning data into subsets of similar instances with no prior knowledge on the clusters to be obtained. however, clustering is an ill-defined problem in the sense that the data partitions output by clustering algorithms have no guarantee to satisfy end users' needs. indeed, different users may be interested in different views underlying the data [25] . for example, considering either the topics or the writing style in a collection of documents leads to different clustering results. in this study, we consider a setting where clustering is guided through user-defined constraints, which is known as constrained clustering [2] . enabling users to provide clustering constraints in the context of an exploratory task can help obtain results better tailored to their needs. typically, must-link and cannot-link constraints are considered (e.g., see [27, 29] ), which state whether two data instances should be (respectively, should not be) in the same cluster. however, important manual annotation efforts may still be required to provide such constraints in sufficient number. in the specific case of document clustering, constraints can otherwise be provided in the form of seed words: each cluster that the user wishes to obtain is described by a small set of words (e.g., 3 words) which characterize the cluster. for example, a user who wants to explore a collection of news articles might provide the set of seed words {'sport', 'competition', 'champion'}, {'finance', 'market', 'stock'}, {'technology', 'innovation', 'science'} to guide the discovery of three clusters on sport, finance, and technology, respectively. recent studies which include seed word constraints for document clustering are mostly focused on topic modeling approaches [8, 16, 17, 19] , inspired by the latent dirichlet allocation model [3] . concurrently, important advances on clustering were recently enabled through its combination with deep representation learning (e.g., see [12, 23, 30, 31] ), which is now known as deep clustering. a common approach to deep clustering is to jointly train an autoencoder and perform clustering on the learned representations [23, 30, 31] . one advantage of deep clustering approaches lies in their ability to leverage semantic representations based on word embeddings, enabling related documents to be close in the embedding space even when they use different (but related) words. the main contributions of this study can be summarized as follows: (a) we introduce the seed-guided deep document clustering (sd2c) framework, 1 the first attempt, to the best of our knowledge, to constrain clustering with seed words based on a deep clustering approach; and (b) we validate this framework through experiments based on automatically selected seed words on five publicly available text datasets with various sizes and characteristics. the remainder of the paper is organized as follows. in sect. 2, we describe existing works on seed-guided constrained document clustering, also known as dataless text classification. section 3 then introduces the seed-guided deep document clustering framework, which is then evaluated in sect. 4. section 5 concludes the paper and provides some perspectives on sd2c. one way to address the above problem is to try and identify multiple clustering views from the data in a purely unsupervised fashion [7, 24, 25] . while such an approach provides users with several possible clustering results to choose from, there is still no guarantee that the obtained clusters are those the users are interested in. the constrained clustering problem we are addressing in fact bears strong similarity with the one of seed-guided dataless text classification, which consist in categorizing documents based on a small set of seed words describing the classes/clusters. for a more general survey on constrained clustering, we invite the reader to refer to [2] . the task of dataless text classification was introduced independently by liu et al. [20] and ko et al. [14] . in [20] , the seed words are provided by a user and exploited to automatically label a part of the unlabeled documents. on the other hand, in [14] , seed words initially correspond to labels/titles for the classes of interest and are extended based on co-occurrence patterns. in both cases, a naive bayes classifier is applied to estimate the documents' class assignments. in the wake of these seminal works, several studies further investigated the exploitation of seed words for text classification [6, 9, 10] . chang et al. [6] introduced both an 'on-the-fly' approach and a bootstrapping approach by projecting seed words and documents in the same space. the former approach simply consists in assigning each document to the nearest class in the space, whereas the latter learns a bootstrapping naive bayes classifier with the class-informed seed words as initial training set. another bootstrapping approach is studied in [10] , where two different methods are considered to build the initial training set from the seed words: latent semantic indexing and gaussian mixture models. the maximum entropy classifier proposed in [9] instead directly uses seed words' class information by assuming that documents containing seed words from a class are more likely to belong to this class. more recently, the dataless text classification problem was addressed through topic modeling approaches [8, 16, 17, 19] , extending the latent dirichlet allocation model [3] . the topic model devised by chen et al. [8] integrates the seed words as pseudo-documents, where each pseudo-document contains all the seed words given for a single class. the co-occurrence mechanism underlying topic models along with the known class membership of pseudo-documents help guide the actual documents to be classified towards their correct class. in [17] , the seed-guided topic model (stm) distinguishes between two types of topics: category topics and general topics. the former describe the class information and are associated with an informed prior based on the seed words, whereas the latter correspond to the general topics underlying the whole collection. the category topics assigned to a document are then used to estimate its class assignment. stm was extended in [16] to simultaneously perform classification and document filtering -which consists in identifying the documents related to a given set of categories while discarding irrelevant documents -by further dividing category topics into relevant and non-relevant topics. similarly to stm, the laplacian seed word topic model (lapswtm) introduced by li et al. [19] considers both category topics and general topics. it however differs from previous models in that it enforces a document manifold regularization to overcome the issue of documents containing no seed words. if these models outperform previously proposed models, they suffer from a lack of flexibility on the input representations they rely on. indeed, topic models require documents to be organized as sets of discrete units -the word tokens. this prohibits the use of representation learning techniques such as word embeddings (e.g., word2vec [22] and glove [26] ). to the best of our knowledge, only one deep learning-based approach was proposed to address a problem similar to dataless text classification [18] . in this recent work, li et al. devised a deep relevance model for zero-shot document filtering -which consists at test time in predicting the relevance of documents with respect to a category unseen in the training set, where each category is characterized by a set of seed words. this problem is nonetheless different from dataless text classification as it focuses on estimating documents' relevance (or lack thereof) instead of class membership. deep clustering consists in jointly performing clustering and deep representation learning in an unsupervised fashion (e.g., with an auto-encoder). all deep clustering approaches aim at obtaining representations that are both faithful to the original documents and are more suited to document clustering purposes than the original document representation. to do so, they trade off between a reconstruction loss, denoted l rec , and a clustering loss, denoted l clust , through a joint optimization problem of the form: l rec +λ 0 l clust , where λ 0 is an hyperparameter balancing the contribution of the reconstruction and clustering losses. in the remainder, x will denote the set of documents to cluster. each document x ∈ x is associated with a representation x in r d -thereafter, the input space -defined as the average of the (precomputed) embeddings of the words in x, where d is the dimension of the word embedding space. each word w is thus represented as a d-dimensional vector w corresponding to its embedding (sect. 4 further discusses the different word embeddings considered). let f θ : r d → r p and g η : r p → r d be an encoder and a decoder with parameters θ and η, respectively; g η •f θ then defines an auto-encoder (ae). r p denotes the space in which we wish to embed the learned document representations -thereafter, the embedding space. lastly, we denote by r the parameters of the clustering algorithm. with a slight abuse of notations in which f θ (x ) corresponds to the application of the function f θ to each element of the set x , the overall deep clustering (dc) optimization problem takes the form: . (1) we propose to integrate constraints on seed words in this framework by biasing the embedding representations, which guarantees that the information pertaining to seed words will be used in the clustering process. this can be done by enforcing that seed words have more influence either on the learned document embeddings, a solution we refer to as sd2c-doc, or on the cluster representatives, a solution we refer to as sd2c-rep. note that the second solution can only be used when the clustering process is based on cluster representatives (i.e., r = {r k } k k=1 with k the number of clusters), which is indeed the case for most current deep clustering methods [1] . in addition to the notations introduced previously, we will denote by s k the subset of seed words corresponding to cluster k, and by s = {s k } k k=1 the complete set of seed words defining the prior knowledge on the k clusters to recover. we further define s = k k=1 s k , the set of seed words from all clusters. one way to bias the document representations according to the seed words is to reduce the gap in the embedding space between the representation of the documents and the representation of the seed words occurring in these documents. for that purpose, we first define, for each document, a masked version of it that is based on seed words. this can be done aggressively, by retaining, in the masked version, only the words that correspond to seed words and by computing an average of their word embeddings, or smoothly by reweighing all words in the original document according to their proximity with seed words. a weighted average of their embeddings then defines the smooth, masked version of the documents. the equation below formalizes these two approaches: where cos denotes the cosine similarity. if a document x does not contain any seed word, m s (x) is ill-defined when using the first version of eq. 2 as w∈s is null in that case. to address this issue, one can simply discard the documents without seed words. in practice, the two masked versions of eq. 2 yielded the same results in our experiments. because of its simplicity, we rely on the first one in the remainder of the paper, which is illustrated in fig. 1 (left). one can then force the embedding representation of documents to be close to the embedding of their masked version by minimizing the dissimilarity in the embedding space, denoted by δ e , between f θ (x) and f θ •m s (x), leading to: where λ 1 is an hyperparameter controlling the importance of the deep clustering loss l dc and the loss associated to seed words. the other bias one can consider in the embedding space is the one related to cluster representatives. here, one can naturally push cluster representatives towards the representation of seed words, in order to ensure that the discovered clusters will account for the prior knowledge provided by them. for that purpose, we first build a representation for each subset of seed words by averaging the word embeddings of the seed words it contains: s k thus corresponds to the seed word-based representation of cluster k in r d . the optimization problem solved by sd2c-rep, depicted in fig. 1 (right) , then takes the form: as before, δ e denotes a dissimilarity in the embedding space. the last term in eq. 4 forces cluster representatives to be close to subsets of seed words, the alignment between the two being defined by the initialization of the cluster representatives performed after pretraining (see sect. 3.1 below). in practice, we use fully differentiable formulations of problems 3 and 4. in the context of the k-means algorithm, a popular clustering method, such differentiable formulations can be directly developed on top of the algorithms provided in [31] (called dcn) and [23] (called dkm), the latter proposing a truly joint formulation of the deep clustering problem. other state-of-the-art deep clustering approaches, as idec [12] , also based on cluster representatives, could naturally be adopted as well. the comparison between these approaches performed in [23] nevertheless suggests that dkm outperforms the other approaches. this difference was confirmed on the text collections retained in this study. we thus focus here on the dkm algorithm introduced in [23] with: where δ i denotes a dissimilarity in the input space, and: where α is an inverse temperature parameter and g k (f θ (x), α; r) is a softmax function parameterized by α defined as follows: the k -means solution is recovered when α tends to +∞. following prior deep clustering works [12, 23, 30, 31] , we initialize the autoencoder parameters through pretraining by first only optimizing the reconstruction loss of the auto-encoder. in the pretraining of sd2c-doc, we also include the constraint-enforcing term (second term in problem 3) so that learned representations are impacted by seed words early in the training. at the end of pretraining, the cluster centers are initialized by the seed words cluster embeddings {s k } k k=1 . 2 then, in the fine-tuning phase, the whole loss -including the clustering loss and the constraint-enforcing loss (for sd2c-doc and sd2c-rep) -is optimized. the experiments we performed to evaluate the proposed sd2c framework are based on five publicly available datasets with various sizes and characteristics that have been extensively used in the context of text classification and clustering: the 20 newsgroups 3 dataset, referred to as 20news ; the reuters-21578 4 dataset, referred to as reuters, from which, similarly to [8, 16, 17, 19] , we use only the 10 largest (and highly imbalanced) categories; the yahoo! answers dataset [32] , referred to as yahoo, from which we use only the test set comprising 60,000 documents evenly split into 10 classes; the dbpedia dataset [32] , referred to as dbpedia, from which we also only use the test set made of 70,000 documents uniformly distributed in 14 classes; and the ag news dataset, introduced as well in [32] and referred to as agnews, from which we use the training set, composed of 120,000 documents evenly split into 4 classes. after preprocessing, which includes removing stop words and words made of less than 2 characters, porter stemming and discarding the empty documents, the number of documents in 20news, reuters, yahoo, dbpedia, agnews are respectively 18,846, 7,964, 59,978, 70,000, 120,000. 20news and reuters contain the documents with the greatest and most varied length whereas dbpedia, yahoo, and agnews are made of rather short documents. for both sd2c-doc and sd2c-rep, different dissimilarities can be adopted for δ i and δ e . as the cosine distance performed consistently better for δ e than the euclidean distance in our preliminary experiments, it is adopted here. we nevertheless did not observe such a clear trend for δ i , and we indicate here the results obtained both for the cosine distance and euclidean distance. this yields two versions for each method, which we denote as sd2c-doc-e/sd2c-rep-e and sd2c-doc-c/sd2c-rep-c, depending on whether the euclidean (*-e) or the cosine (*-c) distance is used for δ i , respectively. to compare against sd2c, we considered the following baseline methods: -km, ae-km and dkm : km corresponds to k-means [21] applied on the same input for documents as the one used for sd2c (average of documents' word embeddings); ae-km first trains an auto-encoder on the collection and then applies k-means to the document embeddings learned by the autoencoder; dkm is the deep k-means algorithm 5 presented in [23] which we also study under the two variants dkm-e and dkm-c. 6 -nn : this method is similar to the 'on-the-fly' nearest neighbor-like classification described in [6] . each document, represented by its word embeddings average, is assigned to the nearest class, in terms of the cosine distance, which outperformed the euclidean distance, represented by the class' average seed word embeddings (denoted as {s k } k k=1 in sect. 3). -stm : in our experiments, we ran the java implementation of the seed-guided topic model [17] provided by the authors 7 and used the standard hyperparameters indicated in the paper. given that this approach was not scalable when the whole vocabulary is used, we only kept the 2000 most frequent words (after preprocessing) for each dataset 8 . recent works on dataless text classification [8, 16, 17, 19] only considered the 20news and reuters datasets in their experiments, relying respectively on the seed words induced by the class labels and on the manually curated seed words from [8] . to perform an evaluation on all the collections retained here, we devised a simple heuristics based on tf-idf to propose seed words. for a given collection and for each class k of the collection, all words w in the vocabulary are scored according to: where idf(w) is the inverse document frequency computed on the documents of the whole collection and tf k (w) is the term frequency for class k, which we define as the sum of tf x (w) for all documents x in class k. the rationale for this score is that one wishes to select words that are frequent in class k and unfrequent in other classes, hence the penalization term inside the brackets. 5 https://github.com/maziarmf/deep-k-means. 6 seed words are not utilized in these approaches. 7 https://github.com/ly233/seed-guided-topic-model. 8 very recently, another topic modeling approach, the laplacian seed word topic model (lapswtm), was proposed in [19] . however, firstly, lapswtm counts 8 hyperparameters that were empirically optimized in the original paper, and it is not straightforward how these hyperparameters should be tuned on the additional datasets used here. secondly, lapswtm shares a lot with the stm model in its construction and performance. thirdly, the code for lapswtm is, as far as we are aware, not publicly available. for these different reasons, we simply chose stm to represent the state of the art in topic modeling-based dataless text classification. based on this score, one can then select the top words for each class as seed words. we emphasize that such heuristics is only adopted for the purpose of simulating seed words during the evaluation: it is not destined to be used to identify seed words in a real-world application, where ground truth is unknown. the auto-encoder used in our experiments on all datasets is similar to the ones adopted in prior deep clustering works [12, 23, 30, 31] . the encoder and decoder are mirrored fully-connected neural networks with dimensions d-500-500-2000-50 and 50-2000-500-500-d, respectively -d is the input space dimension and 50 corresponds to the dimension p of the auto-encoder embedding space. neural networks' weights are initialized based on the xavier scheme [11] . the sd2c, dkm, and ae-km models are trained with the adam optimizer [13] with standard hyperparameters (η = 0.001, β 1 = 0.9, and β 2 = 0.999) and minibatches of 256 documents. the number of epochs for the auto-encoder pretraining and model finetuning are fixed to 50 and 200, respectively, as in [23] . we also use the inverse temperature α = 1000 from [23] for the parameterized softmax-based differentiable reformulations of sd2c models. the balancing hyperparameters λ 0 and λ 1 of sd2c-doc and sd2c-rep were both set to 10 −5 . we experimented with different word embedding techniques including word2vec [22] , doc2vec [15] , and fasttext [4] trained either on an external large corpus (e.g., google news) or individually on the datasets used in the experiments. we found that training the word embedding models on the experiments' collections consistently improved in terms of clustering performance on external corpus-based training. among the word embedding techniques we tested, word2vec and fasttext performed evenly and significantly better than doc2vec. since word2vec is faster to train than fasttext, which operates at the character level, we chose the former technique (in practice, gensim 9 word2vec python implementation) trained on each of our experiments' datasets to compute the word embeddings. the word embedding size was fixed to 100. the skip-gram model was trained with a window size of 50 words on 20news and 10 words on other datasets. note that a word2vec model is trained once for each dataset so that all approaches rely on the same word embeddings. model 20news reuters yahoo dbpedia agnews acc ari acc ari acc ari acc ari acc ari nn we measure the clustering performance in terms of clustering accuracy (acc) and adjusted rand index (ari), which are standard clustering metrics [5] . table 1 first provides the macro-average (over the 5 datasets) of these measures for all methods, using the top 3 automatically selected seed words per cluster. as one can note, the use of seed words is beneficial to the clustering. indeed, the approaches which use seed words (nn, stm, sd2c) have markedly higher acc and higher ari than those which do not (km, ae-km, dkm). among these latter methods, dkm is the best ones (as a comparison, dcn and idec, mentioned in sect. 3, respectively obtain 64.8 and 64.1 for acc, and 49.3 and 47 for ari). among the methods exploiting seed words, sd2c methods are the best ones, outperforming the baseline nn and the stm method by up to 2.6 points for acc and 3.5 points for ari. we further provide in table 2 a detailed account of the performance of the methods based on seed words. the results have been averaged over 10 runs and are reported with their standard deviation. we furthermore performed an unpaired student t-test with a significance level of 0.01 to study whether differences are significant or not (all results in bold are not not statistically different from the best result). as one can note, the proposed sd2c models compare favorably against stm, the strongest baseline. indeed, all sd2c approaches significantly outperform stm on 20news, and sd2c-doc-e/c as well as sd2c-rep-e also significantly outperform stm on yahoo and agnews. on the other hand, stm obtained significantly better results in terms of both acc and ari on reuters and dbpedia, the difference on these collections (and especially on dbpedia) being nevertheless small. among the sd2c methods, sd2c-doc-c yields the best performance overall (as shown in table 1 ). we further compared, in table 3 , the efficiency of stm and the sd2c methods on a machine with eight i7-7700hq cpus at 2.80 ghz, 16 gb ram, and an nvidia geforce gtx 1070 (only used for the deep learning approaches). the runtime of the sd2c approaches is lower than that of stm on most datasets. stm was only faster on agnews (between 2 and 3 times), yet far slower on 20news (about 10 times). this discrepancy can be explained by the fact that stm's complexity is dominated by the total number of tokens (large for a small number of documents in 20news), whereas sd2c models only depend on the number of documents (large with few tokens per document in agnews). as to the sd2c models, sd2c-rep runs faster than sd2c-doc. this is due to the complexity of the constraint-enforcing loss term being lower for the former than for the latter. in our general setting used to report the previous results, the number of seed words per class was arbitrarily set to 3. for comprehensiveness, we study the clustering results of the sd2c models when the number of (automatically selected) seed words per cluster is varied from 1 to 5. the evolution of the performance for the sd2c models in terms of accuracy is illustrated in fig. 2 . we observe that using more seed words leads to notable improvements in most cases -with the exception of sd2c-doc-e, which seems to be less influenced by the number of seed words. this trend is particularly apparent when the number of seed words is increased from 1 to 2. although slight performance gain is observed between 2 and 5 seed words, the results exhibit greater stability. this suggests that providing as few as 2 seed words per cluster -which constitutes a modest annotation effort for humans -can prove highly beneficial for the clustering results obtained by our sd2c approaches. in order to check that the method we retained to automatically extract seed words is appropriate, we also computed the results obtained by stm and the sd2c methods using the manual seed words available for 20news and reuters and presented in, e.g., [8, 17, 28] (denoted as s d in the latter). the corresponding list of seed words contain in average 5.1 words per category for 20news and 6.8 words per category for reuters. the procedure to constitute these lists of descriptive seed words is detailed in [8] . table 4 summarizes the results obtained with such seed words. these results first show that the scores obtained by the different methods using the manual seed words are close to the ones obtained with the automatically selected ones. for example, the difference in acc for stm amounts to only 0.7 points on 20news and 0.2 points on reuters. this shows that the automatic selected seed words are a reasonable substitute to manual seed words for evaluation purposes. in addition, sd2c methods still significantly outperform stm on 20news. sd2c-rep-c is here significantly better, even though the difference is not important, than stm on reuters -this is in line with our comment on table 2 on the small differences between stm and sd2c on reuters. we have introduced in this paper the sd2c framework, the first attempt, to the best of our knowledge, to constrain document clustering with seed words using a deep clustering approach. to do so, we have integrated constraints associated to seed words in the deep k-means optimization problem [23] , modifying either the document embeddings, the cluster representatives or the input representations to make them closer to the seed words retained. the new methods thus derived have been evaluated on five text collections widely used for text classification purposes. for this evaluation, we have proposed a simple method to automatically select seed words that behaves comparably to manual seed words for evaluation purposes. several perspectives for this work can be envisaged. first of all, it is possible to extend the current framework with a 'garbage' cluster to collect documents that do not fit well within the clusters defined by the seed words. this can be useful in particular for document filtering [16] . other types of autoencoders and other attention mechanisms can also be designed to try and improve the results of the sd2c methods. combinations of the different approaches can also be studied so as to benefit from their respective strengths. lastly, if the sd2c-docc method overall outperforms the other approaches in terms of accuracy and adjusted rand index, we want to better understand when it is beneficial to bias the document representations and when to bias the cluster representative ones. clustering with deep learning: taxonomy and new methods constrained clustering: advances in algorithms, theory, and applications latent dirichlet allocation enriching word vectors with subword information locally consistent concept factorization for document clustering importance of semantic representation: dataless classification multiple clustering views from multiple uncertain experts dataless text classification with descriptive lda learning from labeled features using generalized expectation criteria improving text categorization bootstrapping via unsupervised learning understanding the difficulty of training deep feedforward neural networks improved deep embedded clustering with local structure preservation adam: a method for stochastic optimization learning with unlabeled data for text categorization using bootstrapping and feature projection techniques distributed representations of sentences and documents seed-guided topic model for document filtering and classification effective document labeling with very few seed words: a topic model approach a deep relevance model for zero-shot document filtering dataless text classification: a topic modeling approach with document manifold text classification by labeling words some methods for classification and analysis of multivariate observations distributed representations of words and phrases and their compositionality deep k-means: jointly clustering with kmeans and learning representations a nonparametric bayesian model for multiple clustering with overlapping feature views multiple non-redundant spectral clustering views glove: global vectors for word representation computing gaussian mixture models with em using equivalence constraints on dataless hierarchical text classification constrained k-means clustering with background knowledge unsupervised deep embedding for clustering analysis towards k-means-friendly spaces: simultaneous deep learning and clustering character-level convolutional networks for text classification this research was partly funded by the anr project locust and the aura project aisua. key: cord-020871-1v6dcmt3 authors: papariello, luca; bampoulidis, alexandros; lupu, mihai title: on the replicability of combining word embeddings and retrieval models date: 2020-03-24 journal: advances in information retrieval doi: 10.1007/978-3-030-45442-5_7 sha: doc_id: 20871 cord_uid: 1v6dcmt3 we replicate recent experiments attempting to demonstrate an attractive hypothesis about the use of the fisher kernel framework and mixture models for aggregating word embeddings towards document representations and the use of these representations in document classification, clustering, and retrieval. specifically, the hypothesis was that the use of a mixture model of von mises-fisher (vmf) distributions instead of gaussian distributions would be beneficial because of the focus on cosine distances of both vmf and the vector space model traditionally used in information retrieval. previous experiments had validated this hypothesis. our replication was not able to validate it, despite a large parameter scan space. the last 5 years have seen proof that neural network-based word embedding models provide term representations that are a useful information source for a variety of tasks in natural language processing. in information retrieval (ir), "traditional" models remain a high baseline to beat, particularly when considering efficiency in addition to effectiveness [6] . combining the word embedding models with the traditional ir models is therefore very attractive and several papers have attempted to improve the baseline by adding in, in a more or less ad-hoc fashion, word-embedding information. onal et al. [10] summarized the various developments of the last half-decade in the field of neural ir and group the methods in two categories: aggregate and learn. the first one, also known as compositional distributional semantics, starts from term representations and uses some function to combine them into a document representation (a simple example is a weighted sum). the second method uses the word embedding as a first layer of another neural network to output a document representation. the advantage of the first type of methods is that they often distill down to a linear combination (perhaps via a kernel), from which an explanation about the representation of the document is easier to induce than from the neural network layers built on top of a word embedding. recently, the issue of explainability in ir and recommendation is generating a renewed interest [15] . in this sense, zhang et al. [14] introduced a new model for combining highdimensional vectors, using a mixture model of von mises-fisher (vmf) instead of gaussian distributions previously suggested by clinchant and perronnin [3] . this is an attractive hypothesis because the gaussian mixture model (gmm) works on euclidean distance, while the mixture of von mises-fisher (movmf) model works on cosine distances-the typical distance function in ir. in the following sections, we set up to replicate the experiments described by zhang et al. [14] . they are grouped in three sets: classification, clustering, and information retrieval, and compare "standard" embedding methods with the novel movmf representation. in general, we follow the experimental setup of the original paper and, for lack of space, we do not repeat here many details, if they are clearly explained there. all experiments are conducted on publicly available datasets and are briefly described here below. classification. two subsets of the movie review dataset: (i) the subjectivity dataset (subj) [11] ; and (ii) the sentence polarity dataset (sent) [12] . clustering. the 20 newsgroups dataset 1 was used in the original paper, but the concrete version was not specified. we selected the "bydate" version, because it is, according to its creators, the most commonly used in the literature. it is also the version directly load-able in scikit-learn 2 , making it therefore more likely that the authors had used this version. retrieval. the trec robust04 collection [13] . the methods used to generate vectors for terms and documents are: tf-idf. the basic term frequency -inverse document frequency method [5] . implemented in the scikit-learn library 3 . [4] . lda. latent dirichlet allocation [2] . cbow. word2vec [9] in the continuous bag-of-word (cbow) architecture. pv-dbow/dm. paragraph vector (pv) is a document embedding algorithm that builds on word2vec. we use here both its implementations: distributed bag-of-words (pv-dbow) and distributed memory (pv-dm) [7] . the lsi, lda, cbow, and pv implementations are available in the gensim library 4 . the fk framework offers the option to aggregate word embeddings to obtain fixed-length representations of documents. we use fisher vectors (fv) based on (i) a gaussian mixture model (fv-gmm) and (ii) a mixture of von mises-fisher distributions (fv-movmf) [1] . we first fit (i) a gmm and (ii) a movmf model on previously learnt continuous word embeddings. the fixed-length representation of a document x containing t words w i -expressed as where k is the number of mixture components. the vectors g x i , having the dimension (d) of the word vectors e wi , are explicitly given by [3, 14] : where ω i are the mixture weights, γ t (i) = p(i|x t ) is the soft assignment of x t to (i) gaussian and (ii) vmf distribution i, and σ 2 i = diag(σ i ), with σ i the covariance matrix of gaussian i. in (i), σ i refers to the mean vector; in (ii) it indicates the mean direction and κ i is the concentration parameter. we implement the fk-based algorithms by ourselves, with the help of the scikit-learn library for fitting a mixture of gaussian models and of the spherecluster package 5 for fitting a mixture of von mises-fisher distributions to our data. the implementation details of each algorithm are described in what follows. each of the following experiments is conceptually divided in three phases. first, text processing (e.g. tokenisation); second, creating a fixed-length vector representation for every document; finally, the third phase is determined by the goal to be achieved, i.e. classification, clustering, and retrieval. for the first phase the same pre-processing is applied to all datasets. in the original paper, this phase was only briefly described as tokenisation and stopword removal. it is not given what tokeniser, linguistic filters (stemming, lemmatisation, etc.), or stop word list were used. knowing that the gensim library was used, we took all standard parameters (see provided code 6 ). gensim however does not come with a pre-defined stopword list, and therefore, based on our own experience, we used the one provided in the nltk library 7 for english. for the second phase, transforming terms and documents to vectors, zhang et al. [14] specify that all trained models are 50 dimensional. we have additionally experimented with dimensionality 20 (used by clinchant and perronnin [3] for clustering) and 100, as we hypothesized that 50 might be too low. the tf-idf model is 5000 dimensional (i.e. only the top 5000 terms based on their tf-idf value are used), while the fischer-kernel models are 15 × d dimensional, where d = {20, 50, 100}, as just explained. in what follows, d refers to the dimensionality of lsi, lda, cbow, and pv models. the cbow and pv models are trained using a default window size of 5, keeping both low and high-frequency terms, again following the setup of the original experiment. the lda model is trained using a chunk size of 1000 documents and for a number of iterations over the corpus ranging from 20 to 100. for the fk methods, both fitting procedures (gmm and movmf) are independently initialised 10 times and the best fitting model is kept. for the third phase, parameters are explained in the following sections. logistic regression is used for classification in zhang et al., and therefore also used here. the results of our experiments, for d = 50 and 100-dimensional feature vectors, are summarised in table 1 . for all the methods, we perform a parameter scan of the (inverse) regularisation strength of the logistic regression classifier, as shown in fig. 1(a) and (b) . additionally, the learning algorithms are trained for a different number of epochs and the resulting classification accuracy assessed, cf. fig. 1(c) and (d). figure 1 (a) indicates that cbow, fv-gmm, fv-movmf, and the simple tf-idf, when properly tuned, exhibit a very similar accuracy on subj -the given confidence intervals do not indeed allow us to identify a single, best model. surprisingly, tf-idf outperforms all the others on the sent dataset ( fig. 1(b) ). increasing the dimensionality of the feature vectors, from d = 50 to 100, has the effect of reducing the gap between tf-idf and the rest of the models on the sent dataset (see table 1 ). for clustering experiments, the obtained feature vectors are passed to the kmeans algorithm. the results of our experiments, measured in terms of adjusted rand index (ari) and normalized mutual information (nmi), are summarised in table 2 . we used both d = 20 and 50-dimensional feature vectors. note that the evaluation of the clustering algorithms is based on the knowledge of the ground truth class assignments, available in the 20 newsgroups dataset. as opposed to classification, clustering experiments show a generous imbalance in performance and firmly speak in favour of pv-dbow. interestingly, tf-idf, fv-gmm, and fv-movmf, all providing high-dimensional document representations, have a low clustering effectiveness. lsi and lda achieve low accuracy (see table 1 ) and are omitted here for visibility. the left panels [(a) and (b)] show the effect of (inverse) regularisation of the logistic regression classifier on the accuracy, while the right panels [(c) and (d)] display the effect of training for the learning algorithms. the two symbols on the right axis in panels (a) and (b) indicate the best (fv-movmf) results reported in [14] . for these experiments, we extracted from every document of the test collection all the raw text, and preprocessed it as described in the beginning of this section. the documents were indexed and retrieved for bm25 with the lucene 8.2 search engine. we experimented with three topic processing ways: (1) title only, (2) description only, and (3) title and description. the third way produces the best results and closest to the ones reported by zhang et al. [14] , and hence are the only ones reported here. an important aspect of bm25 is the fact that the variation of its parameters k 1 and b could bring significant improvement in performance, as reported by lipani et al. [8] . therefore, we performed a parameter scan for k 1 ∈ [0, 3] and b ∈ [0, 1] with a 0.05 step size for both parameters. for every trec topic, the scores of the top 1000 documents retrieved from bm25 were normalised to [0,1] with the min-max normalisation method, and were used in calculating the scores of the documents for the combined models [14] . the original results, those of our replication experiments with standard (k 1 = 1.2 and b = 0.75) and best bm25 parameter values-measured in terms of mean average precision (map) and precision at 20 (p@20)-are outlined in table 3 . we replicated previously reported experiments that presented evidence that a new mixture model, based on von mises-fisher distributions, outperformed a series of other models in three tasks (classification, clustering, and retrievalwhen combined with standard retrieval models). since the source code was not released in the original paper, important implementation and formulation details were omitted, and the authors never replied to our request for information, a significant effort has been devoted to reverse engineer the experiments. in general, for none of the tasks were we able to confirm the conclusions of the previous experiments: we do not have enough evidence to conclude that fv-movmf outperforms the other methods. the situation is rather different when considering the effectiveness of these document representations for clustering purposes: we find indeed that the fv-movmf significantly underperforms, contradicting previous conclusions. in the case of retrieval, although zhang et al.'s proposed method (fv-movmf) indeed boosts bm25, it does not outperform most of the other models it was compared to. clustering on the unit hypersphere using von mises-fisher distributions latent dirichlet allocation aggregating continuous word embeddings for information retrieval indexing by latent semantic analysis distributional structure. word let's measure run time! extending the ir replicability infrastructure to include performance aspects distributed representations of sentences and documents verboseness fission for bm25 document length normalization efficient estimation of word representations in vector space neural information retrieval: at the end of the early years a sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales the trec robust retrieval track. sigir forum aggregating neural word embeddings for document representation ears 2019: the 2nd international workshop on explainable recommendation and search authors are partially supported by the h2020 safe-deed project (ga 825225). key: cord-020903-qt0ly5d0 authors: tamine, lynda; melgarejo, jesús lovón; pinel-sauvagnat, karen title: what can task teach us about query reformulations? date: 2020-03-17 journal: advances in information retrieval doi: 10.1007/978-3-030-45439-5_42 sha: doc_id: 20903 cord_uid: qt0ly5d0 a significant amount of prior research has been devoted to understanding query reformulations. the majority of these works rely on time-based sessions which are sequences of contiguous queries segmented using time threshold on users’ activities. however, queries are generally issued by users having in mind a particular task, and time-based sessions unfortunately fail in revealing such tasks. in this paper, we are interested in revealing in which extent time-based sessions vs. task-based sessions represent significantly different background contexts to be used in the perspective of better understanding users’ query reformulations. using insights from large-scale search logs, our findings clearly show that task is an additional relevant search unit that helps better understanding user’s query reformulation patterns and predicting the next user’s query. the findings from our analyses provide potential implications for model design of task-based search engines. query reformulation is a critical user behaviour in modern search engines and it is still addressed by a significant amount of research studies [10] [11] [12] 17, 23, 26, 33] . a salient behavioural facet that has been widely captured and analysed by those studies is query history. the latter is generally structured into "query sessions" which are sequences of queries submitted by a user while completing a search activity with a search system. in the literature review, there are many definitions of query sessions. the widely used definitions are the following [19, 25] : (1) a time-based session, also called physical session in [6] , is a set of consecutive queries automatically delimited using a time-out threshold on user's activities. time-gap values of 30 min and 90 min have been the most commonly used in previous research [4, 6, 9, 19] ; (2) a task-based session, also called mission in [6] , is a set of queries that are possibly neither consecutive nor within the same timebased session. the queries belong to related information needs that are driven by a goal-oriented search activity, called search task (eg., job search task). the latter could be achieved by subsets of consecutive related queries called logical sessions in [6] or subtasks in [9] . previous research [4, 7, 20, 21] showed that: (1) users have a natural multitasking behaviour by intertwining different tasks during the same time-based session; and that (2) users possibly interleave the same task at different timestamps in the same time-based session or throughout multiple time-based sessions (ie., multi-session tasks). such long-term tasks are acknowledged as being complex tasks [7, 9] . figure 1 shows a sample of 3 time-based search sessions extracted from the webis-smc-12 search corpus [6] for a single user. the sessions are manually annotated with tasks. as can be seen, 6 tasks (task 1 -task 6) are performed by the user during these 3 sessions. we can observe that all these sessions are multi-tasking, since they include queries that relate to multiple tasks (eg., session 1 is multi-tasking since it includes queries that relate to task 1, 2, 3 and 4). we can also see that task 1 and task 3 are interleaved within and across sessions (eg., task 1 is interleaved within session 1 and across session 1, 2 and 3). thus, tasks 1 and 3 are multi-session tasks. while it is well-known that time-based session detection methods fail in revealing tasks [6, 19] , most of previous research work has employed time-based sessions as the focal units of analysis for understanding query reformulations [10] [11] [12] 26, 33] . other works rather studied users' query reformulations from the task perspective through user studies [15, 17, 29] . however, the authors analysed low-scale pre-designed search tasks conducted in controlled laboratory settings. in addition to their limited ability to observe natural search behaviour, there is a clear lack of comparability in search tasks across those studies. to design support processes for task-based search systems, we argue that we need to: (1) fully understand how user's task performed in natural settings drives the query reformulations changes; and (2) gauge the level of similarity of these changes trends with those observed in time-based sessions. our ultimate goal is to gain insights regarding the relevance of using user's tasks as the focal units of search to both understand and predict query reformulations. with this in mind, we perform large-scale log analyses of users naturally engaged in tasks to examine query reformulations from both the time-based session vs. task-based session perspectives. moreover, we show the role of the task characteristics in predicting the next user's query. our findings clearly show that task is an additional relevant search unit that helps to better understand user's query reformulation patterns and to predict the next user's query. query reformulation has been the focus of a large body of work. a high number of related taxonomies have been proposed [5, 11, 16] . to identify query reformulation patterns, most of the previous works used large-scale log analyses segmented into time-based sessions. different time gaps have been used including 10-15 min [8] , 30 min [4, 19] and 90 min [6, 9] . in a significant body of work, authors categorised the transitions made from one query to the subsequent queries through syntactic changes [11, 12, 23, 26] and query semantic changes [10, 12, 33] . syntactic changes include word substitution, removing, adding and keeping. the results highlighted that the query and its key terms evolve throughout the session regardless of the query position in the session. moreover, such strategies are more likely to cause clicks on highly ranked documents. further experiments on semantic query changes through generalisation vs. specialisation [10, 12] showed that a trend exists toward going from generalisation to specialisation. this behavioural pattern represents a standard building-box strategy while specialisation occurs early in the session. another category of work rather employed lab user studies to understand how different task characteristics impact users' query reformulations [15, 17, 18, 28, 31, 32] . the results mainly revealed that: (1) the domain knowledge of the task doer significantly impacts query term changes. for instance, wildemuth [31] found that search tactics changed while performing the task as users' domain knowledge evolved; (2) the cognitive complexity and structure of the task (eg., simple, hierarchical, parallel) has a significant effect on users' query reformulation behavior. for instance, liu et al. [17] found that specialisation in parallel tasks was significantly less frequent than in simple and hierarchical tasks. a few work [4, 22] used large-scale web search logs annotated with tasks to understand query reformulations. the findings in [4] were consistent with log-based studies [26] showing that page visits have significant influence on the vocabulary of subsequent queries. odijk et al. [22] studied the differences in users' reformulation strategies within successful vs. unsuccessful tasks. using a crowd-sourcing methodology, the authors showed that query specialisation through term adding is substantially more common in successful tasks than in unsuccessful tasks. it also appeared that actions such as formulating the same query than the previous one and reformulating completely a new query are rather relevant signals of unsuccessful tasks. we make several contributions over prior work. first, to the best of our knowledge, no previous study examined the differences in query reformulation strategies from the two perspectives of time-based sessions and task-based sessions viewed as background contexts. insights gleaned from our data analysis have implications for designing task-based search systems. second, although there has been intensive research on query reformulation, we provide a new insight into the variation of query reformulation strategies. the latter are analysed in relation with search episode size (short, medium and long) and search stage (start, middle and end ) from two different viewpoints (stream of query history and the search task progress). third, building on the characterisation of search tasks, we provide insights on how considering task features might improve a supervised predictive model of query reformulations. this analysis is carried out using the freely available webis-smc-12 search corpus 1 [1, 6] extracted from the 2006 aol query log which is a very large collection of web queries. the released corpus comprises 8800 queries. we remove the repeated successive queries that were automatically generated following a click instead of a user's reformulation. we also remove all non-alphanumeric characters from the queries and apply a lowercasing. the cleaned data finally include 4734 queries submitted by 127 unique users. the query log is automatically segmented into time-based sessions using a time-gap threshold on users' activities. since there is so far no agreement about the most accurate time-out threshold for detecting session boundaries [9, 19] , we consider the two widely used time-gap values between successive queries: 30 min as done in [4, 19] and 90 min as done in [6, 9] . we also use the provided manual annotations to segment the query log into task-based sessions. for care of simplicity, we subsequently refer to time-based session as "session" and we refer to task-based session as "task ". table 1 presents the data collection statistics. one immediate observation is that the average number of queries in tasks (3.45) is higher than that of the sessions (eg., 2.04 in the 30 min-sessions) as reported in [9, 19] . the total percentage of multi-tasking sessions is roughly 13% (resp. 16%) of the 30 min-session (resp. 90 min-session). higher statistics (50%) were reported in [19] . however, we found that there are only 30.28% (resp. 31.27%) of the 30-min sessions (resp. 90-min sessions) that include only 1 task that is non interleaved throughout the user's search history. thus, the 70% remaining sessions are either multi-tasking or include interleaved tasks that reoccur in multiple sessions. similar statistics were observed in previous work (eg., 68% in [9] ). another interesting observation is that a high percentage of tasks (23.23%) are interleaved, which is roughly comparable to that of previous studies (eg., 17% in [14] ), or spanned over multiple sessions (e.g, 27.09% of tasks spanned over multiple 30-min sessions). sim(qi, qi+1) jaccard query pair similarity to study query reformulations, we consider the three usual categories of syntactic changes [11, 13, 26] between successive query pairs (q i , q i+1 ) composed of s(q i ) and s(q i+1 ) term sets respectively: (1) query term-retention rr; (2) query termremoval rm acts as search generalisation [12, 13] ; and (3) query term-adding ra acts as search specialisation [12, 13] . for each query pair, we compute the similarity and the query reformulation features presented in table 2 , both at the sessions and tasks levels (sect. 5). here, our objective is twofold: (1) we investigate how query length (ie., # query terms) varies across the search stages within sessions and tasks of different sizes (ie., # queries); and (2) we examine in what extent the trends of query length changes observed within tasks are similar to those observed within sessions. to make direct comparisons of trends between sessions and tasks with different sizes in a fair way, we first statistically partition the search sessions and tasks into three balanced categories (short, medium and long). to do so, we compute the cumulative distribution function (cdf) of session size values for the 30-min and the 90-min sessions, as well as the cdf of task size values in relation with the number of included queries. then, we compute the cdf of the search stage values in relation with the query position boundary (start, middle and end ) along each size-based category of sessions vs. tasks. since short sessions and tasks only contain 1 query and consequently do not contain query reformulations, we do not distinguish between the search stages nor consider this category of sessions and tasks in the remainder of the paper. table 3 shows the statistics of the search stages (start, middle, end ) with respect to medium and long sessions and tasks. based on those categorisations, fig. 2 shows the variation of the query length limit within each category of sessions and tasks and along the different search stages. we can see two clear trends. first, queries in both longer sessions and longer tasks generally tend to contain more terms (2.60-2.87 vs. 2.41-2.51 in average). this trend remains along all the different search stages. regarding sessions, previous studies [2] have also shown similar trends in log-based data. regarding tasks, our results suggest that long tasks require to issue more search terms. one could argue that long tasks, that more likely involve complex information needs, lead users to formulate more informative queries. we also relate this observation with previous findings [2] showing that increased success is associated with longer queries, particularly in complex search tasks. second we can surprisingly see that in general, queries observed within sessions whatever their sizes, are slightly longer in average than queries issued within tasks of the same category except at the end of the search stage. by cross-linking with the cdf results presented in table 3 , we expect that this observation particularly relates to long sessions. one possible explanation is that since long sessions are more likely to be multi-tasking (eg., there are 1.57 task in average in the long 90-min sessions vs. 1.29 in the 30-min sessions), the average query length is particularly increased within sessions that include queries at late search stages of the associated tasks (middle, end ). inspired by [13] , we examine query term frequency along the search with respect to session vs. task search context. in contrast to [13] , our underlying intent here is rather to learn more about the impact of search context (ie., session vs. task) on the level of query term reuse. for a query q i belonging to session s and task t and not submitted at the beginning (ie., i > 1), we compute the frequency of each of its terms from the previous queries within the same session q s j (resp. same task q t j ), j = 1..i − 1. then, we take the maximal value t r as "maximum term repeat" for query q i if the latter contains at least one term used t r times in previous queries. figure 3a plots the average "maximum term repeat values" for all the queries within all the sessions and tasks ranged by size (short, medium and long). we can see that the term repeat trend across sessions is similar to that reported in [13] . by comparing between the term repeat trends in sessions and tasks, we clearly observe that there are less reformulated queries that do not share any identical terms with the previous queries in tasks (eg., 70% of medium tasks) in comparison to sessions (eg., 75-78% of medium sessions). interestingly, we can see that the difference is particularly higher in the case of long tasks and long sessions (33% vs. 53-54%). however, we can notice that even if the percentage of queries sharing an increased number of terms with previous queries decreases for both medium sessions and medium tasks, the difference is reversed between long sessions and long tasks. it is more likely that query terms are renewed during long tasks which could be explained by shifts in information needs related to the same driving long-term task. figure 3b shows the percentage of reformulated queries for which each reused term occurs at the first time at a given position within sequences from length 1 to 6. it appears that the sources of reused query terms in both tasks and sessions are limited to the two previous queries. more particularly, while we find terms used in the previous query in all (100%) of the reformulated queries in medium sessions and medium tasks, it is more likely to observe reformulated queries containing terms from the two previous queries in long sessions than in long tasks (71% of sessions vs. 46% of tasks). to sum up, the context used for driving query actions is limited to the two previous queries even for long sessions and tasks, with however, a lower level of term reuse in long tasks. given each query q i belonging to session s (resp. task t ), table 4 gives the query reformulation feature values (see table 2 ) for both medium (m) and long (l) sessions and tasks and are computed over: (1) the short-term context (sc), by considering the query reformulation pair observed within the same session s (resp. task t ) (q i , q i+1 ) s (resp. (q i , q i+1 ) t ), i ≥ 1; and (2) the long-term context (lc), by considering the set of successive query reformulation pairs within the same session s (resp. task t ), (q k , q k+1 ) s (resp. (q k , q k+1 ) t ), 1 ≤ k ≤ i. significance of the differences between the "within session" scenario and the "within task" scenario considering either the short-term context (sc) or the long-term context (lc) is computed using the non-paired student t-test. we can see from table 4 that for the whole set of search actions (ie., term-retention rr, termremoval rm and term-adding ra) and similarity values (ie., avg sim), most of the differences between task-based and session-based scenarios are highlighted as significant. more particularly, we can make two key observations: (1) successive queries in both medium and long tasks are significantly more similar (avg sim of 0.27 and 0.25 respectively) than they are in medium and long sessions for both time-out thresholds (avg sim of 0.20-0.23) with higher ratios of term-retention (34% vs. 25-29%); and (2) the query history along long tasks exhibits a higher topical cohesion (avg sim of 0.24) than it does in long sessions (avg sim of 0.18-0.20) with a higher ratio of term-retention (30% vs. 23-26%) and a lower ratio of term-adding (70% vs. 74-77%) for tasks. all these results are consistent with those obtained through the analysis of query term repeat (sect. 4.2). they suggest that longer tasks more likely include topically and lexically closer information needs that might drive subtasks in comparison with long sessions. unlikely, the latter might include multiple and topically different information needs that belong to distinct tasks. to better understand the changes trends along the search, we also examine (fig. 4 ) the query reformulation similarities at different stages of the search sessions vs. tasks by considering both short-term context (sc) and long-term context (lc). we can make from fig. 4 depending on the context used (session vs. task) to make the observation. as outlined earlier through query length analysis (sect. 4.1), sessions might include different ongoing tasks that lead to formulate lexically distinct queries. unlikely, tasks might include different ongoing related subtasks. however, queries are still overall more similar (m = 0.13, sd = 0.23, avg = 0.20) across the search stages in long tasks than they are in long sessions (m = 0.11, sd = 0.17, avg = 0.16), particularly at the end of the search stage. this observation might be related to the better cohesiveness of tasks with increased number of queries since, unlike sessions, they are goal-oriented. through the analyses presented in the previous sections, we have shown that there are significant differences in query reformulation patterns depending potentially on the context used (session or task) to make the observations. the results also indicate that time threshold value used to segment the sessions has no impact on the differences trends. in general, the most significant differences are observed regarding long tasks. informed by these findings, we show in the final contribution of this paper the potential of the task features studied in sects. 4 and 5 for enhancing the performance of a query reformulation predictive model. given a session s = {q 1 , q 2 , . . . , q m −1 , q m }, we aim to predict for each query sequence s k ⊂ s, s k = {q 1 , q 2 . . . , q k−1 , q k }, 1 < k < m, the target query q k given the context c q k defined by queries {q 1 , q 2 . . . , q k−1 ), where q k−1 is the anchor query. evaluation protocol. as usually done in previous work for query autocompletion [13] and next query prediction [3, 24, 27] , we adopt a train-test methodology. we first sort the 30 min-sessions time-wise and partition them into two parts. we use the first 60 day-data for training the predictive model and the remaining 30 days for testing. we use 718 sessions (including 2418 queries) which represent 70% of the dataset as our training set, and 300 sessions (including 998 queries) which represent 30% of the dataset as our testing set. to enable the evaluation of the learning approach, we first produce a set of ground truth suggestions for each test query. to do so, we follow a standard procedure [3, 13, 27] : for each session in the training-test sets, we select as the candidate set, the top-20 queries q k that follows each anchor query q k−1 , ranked by query frequency. to assess the contributions of the task context features in predicting the next user's query, we use the baseline ranker, a competitive learning to rank query suggestion model that relies on contextual features [3, 27] . model training. we design the task-aware baseline ranker which we refer to as taskranker. for training purpose, we first generate from the 718 training sessions, 1395 task-based query sequences that are built with respect to the task labels provided in the webis-smc-12 search corpus. we remove the task-based query sequences with only 1 query candidate. for instance, using task labels provided in fig. 1 , we built and then select from session 1 the task-based query sequences {q1, q6}; {q3, q4} with respectively q6 and q4 as the ground truth queries. besides, to guarantee the candidate set includes the target query, we remove the task-based query sequences whose ground truth is not included in the associated candidate sets. after filtering, we obtain 215 cleaned task-based query sequences used for training the taskranker model. similarly to [3, 27] , we use the state-of-the-art boosted regression tree ranking algorithm lamdamart as our supervised ranker. we tune the lamdamart model with parameters of 500 decision trees across all experiments. we use 2 sets of features (30 in total): (1) 10 features related to the analyses conducted in previous sections of the paper (sects. 4, 5) . we use the user-action related features including ratios of termretention (rr ), term-adding (ra), term-removal (rm), and term-repeat (tr ), that are measured using both the short-term (sc) and long-term (lc) contexts. we also use query-similarity related features (avg sim) based on the similarity of the target query q k with short-term context sc (anchor query q k−1 ) and long-term context lc (with the previous queries in c q k ); (2) 20 features that are similar to those previously used for a learning to rank suggestion model, and described in detail in [3, 27] . this set of features includes (a) pairwise and suggestion features based on target query characteristics and anchor query characteristics including length and frequency in the dataset; (b) contextual features that include n-gram similarity values between the suggestion and the 10 most recent queries. note that we extended the baseline ranker released by sordoni et al. [27] 2 . baselines and evaluation metric. we use the conventional models widely used in the literature [3, 13, 27] with respectively q2, q3, q4, q5 and q6 as the ground truth queries. we obtain 1700 session-based query sequences that are then cleaned, similarly to the taskranker by removing query sequences with only 1 query candidate and those with ground truth not included in the associated candidate sets. finally, the ses-sionranker has been trained on 302 cleaned session-based query sequences. similarly to the taskranker, we use the same sets of features (30 in total) learned here at the session level, and we tune it using the lamdamart model. we use the mean reciprocal rank (mrr) which is the commonly used metric for evaluating next query prediction models [3, 24, 27] . the mrr performance of the taskranker and the baselines is measured using the same test subset that includes 150 cleaned session-based query sequences built up on the subset of 698 session-based query sequences generated from the 300 test sessions. the task annotations of the testing test are ignored. table 5 shows the mrr performance for the taskranker and the baselines. the taskranker achieves an improvement of +152.8% with respect to the mps model and an improvement of +10.2% with respect to the sessionranker model. the differences in mrr are statistically significant by the t-test (p < 0.01). it has been shown in previous work [3, 27] that session size has an impact on the performance of context-aware next query prediction models. thus, we report in fig. 5 separate mrr results for each of the medium (2 queries) and the long sessions (≥3 queries) studied in our analyses (sects. 4 and 5). as can be seen, the task-based contextual features particularly help predicting the next query in long sessions (+14, 1% in comparison to the sessionranker, p = 7× 10 −3 ). prediction performance for medium sessions is slightly but not significantly lower (−1, 3% in comparison to the sessionranker, p = 0.65). this result can be expected from the findings risen from our analyses, since long sessions include queries related to 89.9% of long tasks whose cohesive contexts enable more accurate predictions of user's future search intent. better understanding user's query reformulations is important for designing task completion engines. through the analysis of large-scale query logs annotated with task labels, we have revealed significant differences in the query changes trends along the search depending on the retrospective context used, either session or task. we found that queries are even longer in longer tasks with however a lower level of term reuse in tasks than in sessions. in addition, terms are particularly renewed in long tasks indicating clear shifts in information needs. using lexical similarity measures, we have also shown that the query reformulations exhibit a clearer cohesiveness within tasks than within sessions along the different search stages, with however a decreasing level of similarity. finally, we provided insights on the usefulness of task features to enhance the user's next query prediction accuracy. given the crucial lack of query logs with annotated tasks, we acknowledge that the predictive model has been trained and tested with limited amount of data. however, the features used are based on the analysis performed on a large-scale data provided in the webis corpus. thus, we believe that the trend of our results would remain reliable. there are several promising research directions for future work. firstly, evidence related to the characterization of tasks through query length variation and query reformulation similarities along the search, presented in sects. 4 and 5, may benefit research on automatic task boundary detection. in sect. 6, we showed that learning from query streams annotated with tasks helps the query suggestion process particularly for long-term tasks. it will be interesting to design a predictive model of query trails associated with subtasks, by analogy to search trails [30] . this might help users in completing complex tasks by issuing fewer queries. this would decrease the likeliness of search struggling as shown in previous work [22] . webis corpus archive leading people to longer queries learning to attend, copy, and generate for session-based query suggestion lessons from the journey: a query log analysis of within-session learning a unified and discriminative model for query refinement from search session detection to search mission detection supporting complex search tasks detecting session boundaries from web user logs user behaviour and task characteristics: a field study of daily information behaviour learning to rewrite queries analyzing and evaluating query reformulation strategies in web search logs patterns of query reformulation during web searching learning user reformulation behavior for query auto-completion beyond the session timeout: automatic hierarchical segmentation of search topics in query logs humancomputer interaction: the impact of users' cognitive styles on query reformulation behaviour during web searching patterns of search: analyzing and modeling web query refinement analysis and evaluation of query reformulations in different task types factors that influence query reformulations and search performance in health information retrieval: a multilevel modeling approach identifying taskbased sessions in search engine query logs characterizing users' multi-tasking behavior in web search uncovering task based behavioral heterogeneities in online search behavior struggling and success in web search analysis of multiple query reformulations on the web: the interactive information retrieval context learning to rank query suggestions for adhoc and diversity search analysis of a very large web search engine query log a term-based methodology for query reformulation understanding a hierarchical recurrent encoder-decoder for generative context-aware query suggestion on the impact of domain expertise on query formulation, relevance assessment and retrieval performance in clinical settings a theory of the task-based information retrieval assessing the scenic route: measuring the value of search trails in web logs the effects of domain knowledge on search tactic formulation examining the impact of domain and cognitive complexity on query formulation and reformulation application of automatic topic identification on excite web search engine data logs key: cord-020918-056bvngu authors: nchabeleng, mathibele; byamugisha, joan title: evaluating the effectiveness of the standard insights extraction pipeline for bantu languages date: 2020-03-17 journal: advances in information retrieval doi: 10.1007/978-3-030-45439-5_11 sha: doc_id: 20918 cord_uid: 056bvngu extracting insights from data obtained from the web in order to identify people’s views and opinions on various topics is a growing practice. the standard insights extraction pipeline is typically an unsupervised machine learning task composed of processes that preprocess the text, visualize it, cluster and identify the topics and sentiment in each cluster, and then graph the network. given the increasing amount of data being generated on the internet in africa today, and the multilingual state of african countries, we evaluated how well the standard pipeline works when applied to text wholly or partially written in indigenous african languages, specifically bantu languages. we carried out an exploratory investigation using twitter data and compared the outputs from each step of the pipeline for an english dataset and a mixed bantu language dataset. we found that for bantu languages, due to their complex grammatical structure, extra preprocessing steps such as part-of-speech tagging and morphological analysis are required during data cleaning, threshold values should be adjusted during topic modeling, and semantic analysis should be performed before completing text preprocessing. the growing penetration of mobile telephony and internet services in africa has led to an increased presence of african user-generated content, especially on social media platforms (such as facebook, twitter, and whatsapp). according to internet world stats [8] , by the end of 2018, over 460 million out of the continent's 1.3 billion people used the internet, and there were over 200 million facebook subscribers at the end of 2017. this represents a 35.2% internet penetration rate and a 15.5% facebook penetration rate [8] . the user-generated content has been leveraged to obtain insights about elections [23] , design marketing strategies [1] , and monitor the aftermath of epidemics [19] . however, only the content that is written in languages with high-quality linguistic resources such as english, french, portuguese, and arabic are used for such analyses and content generated in indigenous african languages is largely excluded. it has been found that even though the amount of content generated in indigenous african languages is significantly lower than non-indigenous language content, it nonetheless contains valuable insights, especially relevant to the local context [12] . hence, it is extremely important that we develop resources and tools that can be used to parse out useful information from free-text written in any language. in this paper, we investigated whether the standard insights extraction pipeline is sufficient when applied to a single language family indigenous to africa, bantu languages, using the following questions: (1) how well does the standard insights extraction pipeline apply to bantu languages; and (2) if found to be inadequate, why, and how can the pipeline be modified so as to be applicable to bantu languages? two datasets of 20,000 tweets each were included in the study: one was comprised solely of english text and the other a mixed batch of six bantu languages and english text. both datasets were analysed using a seven-step pipeline: (1) text preprocessing and normalization, (2) dimensionality reduction, (3) visualization, (4) clustering, (5) topic modeling, (6) sentiment analysis, and (7) network graphing; and the differences in outcomes were measured. we found that: (1) there is a need to differentiate between conjunctively and disjunctively written languages; (2) sentiment analysis should be performed before verb stemming during text preprocessing, before any present negation morpheme is removed; (3) during text preprocessing and normalization, stemming verbs and adjectives is crucial to avoiding very high levels of sparsity in the representation matrix; (4) stemming nouns must be avoided so as to prevent the loss of important semantic information; and (5) during topic modeling, some threshold values must be adjusted to account for agglutination. this evaluation has, to the best of our knowledge, never been done for bantu languages. the rest of the paper is arranged as follows: in sect. 2, a brief background on bantu languages and their grammatical structure is presented; sect. 3 presents related work on extracting insights using the standard pipeline; and the methods, investigation, and results of the evaluation are presented in sect. 4. the implications of our findings are discussed in sect. 5, and we conclude in sect. 6. bantu languages are indigenous to africa, geographically extending from the south, below nigeria, to most of central, east, and southern africa, they are found in 27 of the continent's 54 countries, and range in number from 300 to 680 [21] . bantu languages have an agglutinating morphology, where words consist of several morphemes, and each affix agglutinated with the root word carries meaning such as tense and aspect [21] . the writing system of bantu languages is either conjunctive or disjunctive [25] . in the former case, several orthographic words, 'i love them', are written as a single word, for example, mbakunda in runyankore (a language indigenous to uganda). the latter case writes different orthographic words as separate words. for example, the same translation for 'i love them' is kea ba rata in sepedi (a language indigenous to south africa). the hallmark of bantu nominal morphology is the noun class (nc), where all nouns are assigned to a class; and there are over 20 ncs, although some have fallen into disuse in most languages [17, 21] . a simple noun comprises a prefix and a stem [11] ; for example, omuntu, 'person' in runyankore, can be analyzed as the prefix o-mu-and stem -ntu. however, not all bantu languages have the initial vowel on the prefix [11, 17] ; for example, 'person' in sepedi is motho, with prefix mo-and stem -tho. noun classes are also at the heart of an extensive system of concordial agreement that governs grammatical agreement in verbs, adjectives, possessives, subject, object, etc. [11, 25] ; this is a pivotal constituent of the whole bantu sentence structure [25] . the morphological and phonological structure of bantu verbs is very regular in most languages [20, 24] , with a typical verbal form consisting of: one or more bound morphemes, a verb-root, and one or more extensions [24] . the morphemes preceding the verb-root specify the person, noun class, tense, aspect, time, negation, etc., while the extensions specify valency-changing categories,the arguments controlled by a verb-which can be as many as eleven [24] . additionally, bantu languages typically have a large number of tenses, with up to four observed past tenses, and up to three observed discrete future tenses [20] . this complex grammatical structure is partly what has led to bantu languages being largely computationally under-resourced, despite still primarily being a first language throughout the continent. in the next section, we present related work on extracting insights from collections of data. social media data mining has become a common tool used to extract opinions from a large population in order to monitor, understand, and predict people's reactions to an event, and to measure the diffusion of ideas within the social network [15] . in this section, related work on extracting insights from collections of documents is presented. the scope here is limited to social media data, specifically textual data, and more so, twitter data, because the vast amount of content generated and shared through social media contains rich knowledge and covers a wide spectrum of social dynamics [33] . in their socio-semantic analysis of twitter data, lipizzi et al. [15] stated that the following processes are necessary to extract complete and valuable insights from data: (1) preprocessing the text, (2) identifying and classifying opinions in the network, (3) analyzing the sentiment of individual or groups of text, (4) visualization of the large amounts of data; and (5) extracting conversational maps from social streams. we subdivided these processes into seven steps, including dimensionality reduction and clustering, shown in fig. 1 . text preprocessing is necessary because of the strong heterogeneity and noisiness characteristic of social media texts [26] . it involves dealing with incorrect spelling, contractions and abbreviations, stop words, inflectional variants, user tags, hyperlinks, numbers, and email addresses [26, 32] . however, the steps taken during preprocessing depend on the quality, quantity, and style of the underlying text [32] . data visualization requires that the large amount of data be compacted in an expressive fashion [15] . because text documents are high dimensional objects, effectively visualizing such data requires it to be projected to a lower-dimensional space [18] . thus, dimensionality reduction, which involves transforming highdimensional data into a meaningful representation of reduced dimensionality, is an essential part of text mining [6, 31] . however, for the lower dimensional representation to be meaningful, it must be a good approximation of the original document set given in its full space [6] . the commonly used techniques for dimensionality reduction are: principal component analysis (pca), which both minimizes information loss and increases interpretability [6, 10] ; singular value decomposition (svd), a stable and robust technique [9] ; and t-distributed stochastic neighbor embedding (t-sne), which is particularly well suited for the visualization of high dimensional datasets [30] . document clustering aims to efficiently organize, navigate, summarize, and retrieve documents [3] . it can either be done using partitioning algorithms, where the number of clusters is specified before clustering takes place (for example, k-means [2, 4] , locally adaptive clustering [3] ), and non-negative matrix factorization [13, 14, 29] ; or hierarchical algorithms, which start by either considering each document as a cluster (agglomerative clustering) or all documents as belonging to a single cluster (divisive clustering) [2, 35] . in the former case, documents are continually assigned to the nearest cluster until no further improvement is achieved, while the latter either decreases or increases the number of clusters until a stopping condition is met [2, 35] . topic modeling is the application of probabilistic models to uncover the underlying semantic structure of a collection of documents, where each topic is defined as a distribution over a set of words [2, 34] . there are several topic modeling algorithms, but the most commonly applied are latent dirichlet allocation (lda) [2, 4, 26, 33, 34] and non-negative matrix factorization (nmf) [7, 13, 14, 22, 29] . sentiment analysis involves the computational study of people's opinions, appraisals, attitudes, and emotions about events, entities, individuals, and topics [16, 27] . features found to be important during sentiment analysis include: terms and their frequency, adjectives, negation, and opinion words and phrases [16] . sentiment analysis can be formulated either as a supervised learning problem that can be solved using well-known classification algorithms such as naive bayes or support vector machines [16, 27] , or as an unsupervised problem where opinion words and phrases are used as the dominating indicators of sentiment [16] . network graphing is used to provide structure to the information exchanged in a social network, and has mostly been used to identify influential users on a topic for marketing or advertising services [4] . here, each user in a social network is considered as a node in a graph, and the relationships between users (follow, retweet, like, etc.) as directed edges between nodes in the graph [4] . the above processes have been found to be sufficient to extract insights from text in other languages beyond english, such as french [26] , chinese [34] , and arabic [2] . however, to the best of our knowledge, no work has been done completely to apply the described pipeline to bantu languages. here, the methodology and results of evaluating the suitability of the standard pipeline for use with bantu languages are presented. we used two datasets in this evaluation, each comprising 20,000 tweets; the first, an english dataset composed of customer reviews 1 ; the second, composed of tweets in english, mixed code, and six bantu languages, was archived directly from live south african and ugandan tweets covering the period february 2019 to may 2019. the live tweets were archived based on the trending hashtags during the period of data collection. the six bantu languages targeted were isizulu, luganda, runyankore, sepedi, sesotho, and setswana. these languages were selected because they cover both conjunctive and disjunctive writing styles, and they are understood by the authors. however, due to the use of the mixed code writing style, we found tweets that contained terms in other bantu languages beyond the six considered. our investigation was limited to twitter data due to the inherent difficulty of performing opinion mining on it, resulting from the informal writing style used and limited tweet length. we hypothesize that the findings based on twitter data are generalizable to other social media platforms. we further limited the size of each dataset to 20,000 tweets, as the results based on a limited dataset are also generalizable to a larger dataset. both datasets were run through the seven processes in the standard pipeline, and analyzed for any significant differences. for text preprocessing, we used the same techniques as described in [26, 32] . however, no stemming/lemmatizing was performed on either dataset because, to the best of our knowledge, two of the bantu languages (luganda and runyankore) do not have tools for this 2 . we used multiple approaches for dimensionality reduction (pca, and a combination of svd followed by t-sne), clustering (k-means and nmf) and topic modeling (lda and nmf), in order to consider the approach which gives the better result. gephi 3 was used to graph the network. at each step in the pipeline, the results between the two datasets were compared, with emphasis placed on any observed differences, significant or otherwise. where a process in the pipeline was found to be insufficient to process the bantu language dataset, we then investigated if and how the complex grammatical structure of these languages causes the observed limitations. we further investigated what needs to be done in order to adapt that process to fulfill the same task for bantu languages. on the other hand, where a process in the pipeline was found to adequately apply to the bantu language dataset, we noted this finding and proceeded to the next step. at the end of the evaluation, the processes of text preprocessing, topic modeling, and sentiment analysis were found to require some modification in order to sufficiently extract meaningful insights from textual data in bantu languages. the processes of dimensionality reduction, data visualization, and clustering, though being language independent, were also found to be affected by the termdocument matrix, which is itself language dependent. only network graphing was found to be completely language independent. the following subsections provide details on the limitations found during text preprocessing, topic modeling, and sentiment analysis, and explain the findings based on the grammatical structure of bantu languages. text preprocessing. during text preprocessing, after converting the data to lower case, it underwent the removal of html tags, urls, numbers, email addresses, twitter handles, and hashtags; then the expansion of contractions (such as can't and we're) and abbreviations (such as lol, dm, and tbh); and finally, the elimination of non-alphanumeric characters and stop words. the text in both datasets was not stemmed or lemmatized due to the lack of such resources for some of the bantu languages considered in this investigation. for the english-only dataset, the preprocessing performed was found to be sufficient. however, we found that several additional processes are necessary to fully preprocess the mixed bantu language dataset. these processes are: distinguishing conjunctively versus disjunctively written languages, part-of-speech tagging, and stemming/lemmatizing only verbs and adjectives. distinguishing between conjunctively and disjunctively written languages. the mixed bantu language dataset comprised three conjunctively written languages (isizulu, luganda, and runyankore) and three disjunctively written languages (sepedi, sesotho, and setswana). as explained in sect. 2, bantu languages are written either conjunctively or disjunctively, and therefore, there is a need to differentiate between them in order to perform the appropriate preprocessing. taljard and bosch [25] identified that a word-class tagger is sufficient for disjunctively written languages, while a morphological analyzer is required for the conjunctively written languages. this is because the disjunctive system of writing requires bound morphemes to be written as orthographically distinct units (kea ba rata 'i love them' in sepedi), thus making morphological information explicit in the orthography [25] . on the other hand, the conjunctive writing style requires a morphological analyzer to make the different morphemes in the orthography explicit [25] , for example from mbakunda to m-ba-kunda 'i love them' in runyankore. the authors concluded that the differences in writing systems necessitate the use of different architectures specifically for part-of-speech tagging. the need for part-of-speech tagging was identified as crucial during text preprocessing and therefore, the type of writing style first needs to be identified before this can be performed. part-of-speech tagging. though neither stemming nor lemmatization were performed on both datasets during preprocessing, we nonetheless recognize the need to stem/lemmatize the verbs and adjectives because of their numerous grammatical forms. nouns, on the other hand, should not be stemmed as this would result in the loss of their core semantics. as explained in sect. 2, a noun is composed of a prefix and a stem. however, the stem of a noun is not unique, but rather gets its full semantics from the prefix. table 1 shows examples of tweets from the dataset where stemming the noun will result in a meaningless stem 4 (the nouns of interest are in bold font, with the prefix underlined). the examples shown in table 1 highlight the problem that can result if nouns are stemmed during text preprocessing. for sepedi, sepedi (a language) would be indistinguishable from mopedi (a member of the bapedi tribe); for setswana, setswana (a language) would have the same stem as botswana (a country); for isizulu, umuntu (person) would be reduced to the same stem as abantu (people). additionally, for isizulu, a conjunctively written language, the example also shows the need for morphological analysis, to separate the copulative ng from the noun. with the semantics of the noun removed through stemming, the resultant stems -pedi, -tswana, and -ntu are meaningless without a prefix. this in turn would affect topic modeling downstream. part-of-speech tagging is therefore required to differentiate between nouns that should not be stemmed and other parts-of-speech that should. stemming verbs and adjectives. a typical bantu language verbal form consists of one or more bound morphemes, a verb-root, and one or more extensions [24] . the bound morphemes include the subject and object, which are determined by the noun class, as is the full adjectival form [11, 25] . therefore, for a language like runyankore with 20 noun classes, there are 400 different ways of conjugating a single verb stem for subject and object. additionally, the number of extensions can be as many as nine, as shown in table 2 , where a single verb stem reeb-in runyankore and bon-in sepedi for 'see' is extended. in addition to the increasing number of verb forms owing to the extensions shown in table 2 and the noun class system, bantu languages typically have a very large number of tenses [20] . for example, runyankore has 14 tenses [28] and these too are part of the verb form. this complex grammatical structure results in a single verb root having thousands of possible verb forms. therefore, verb stemming/lemmatizing is a crucial step during preprocessing, which, if not performed results in very high levels of sparsity in the resultant matrix. 4 the translations to the text in table 1 adjectives also require stemming because the full form of an adjective depends on the noun class of the noun being described. therefore, the number of forms that a single adjective can take depend on the number of noun classes in that language. runyankore, for example, has 20 different forms for each adjectival stem because it has 20 noun classes. table 3 shows some examples of the forms that the adjective 'beautiful' in runyankore -rungi and sepedi -botse (the adjective prefix is underlined). [21] . as a result, for the conjunctive writing style, an entire sentence can be represented as a single word. consider the following runyankore example from [28] : titukakimureeterahoganu, meaning 'we have never ever brought it to him', and comprises the morphemes titu-ka-ki-mu-reet-er-a-ho-ga-nu. for this reason, all tweets, despite their length, were included in the topic modeling of the mixed bantu language dataset. we also included all tokens in the mixed bantu language dataset during topic modeling. although, this is contrary to the recommended minimum token count of three, it was done because, as explained in sect. 4.2, a single verb stem can be inflected into thousands of verb forms, and it should therefore be expected that, without performing verb stemming, such tokens will be extremely rare in the dataset. from measuring the number of tokens below the recommended threshold count of three in both datasets, we found that 72.04% of tokens in the mixed bantu language dataset were below this threshold, compared to 0.00% in the english dataset. this is a significant result, again pointing to the importance of verb stemming during preprocessing. conversely, the english dataset, which was not stemmed either, does not show such an adverse need for it. there are currently no publicly available sentiment analysis implementations for any of the bantu languages used during this investigation. however, we assessed the currently available tools to evaluate whether sentiment analysis could be done following the standard pipeline. in sect. 3, four features were identified as important for sentiment analysis; three of these (terms and their frequency, adjectives, and opinion words and phrases) are also applicable to a bantu language dataset. however, if verb stemming is performed during text preprocessing (as we recommend in sect. 4.2), negation will present a differentiating factor for bantu languages. this is because, for conjunctively written languages, the negation morpheme(s) is agglutinated to the verb stem, while for disjunctively written languages, the negation morpheme is not necessarily only used in the context of negation. consider the excerpts from the dataset shown in table 4 5 writing styles (the negation morphemes are underlined and the verb roots are in bold font). in the standard pipeline, sentiment analysis is performed after text preprocessing, visualization, clustering, and topic modeling, in order to assess the sentiment associated within a specific cluster or topic. however, for bantu languages, once verb stemming is performed during text preprocessing, then the verbs in table 4 are reduced to their roots (shown in bold font); thus losing the negation morphemes ti, ta, ga, and e. further complexities during sentiment analysis arise from:(1) multiple rules regarding negation, and (2) the negation morpheme being applicable to other parts of speech other than negation. for the former case, consider the example of runyankore, where ti is the primary negative and ta the secondary negative; sepedi, in addition to the negation morpheme ga, encodes negation in the change of the final vowel from a to e. losing such morphemes would in turn skew the results on sentiment analysis further down the pipeline. from the findings presented in sect. 4, we have shown that bantu languages require a different architecture from the 'standard'. we therefore propose an alternative architecture shown in fig. 2 . the following are the areas where differences arise (note that the other processes maintain their original placement in fig. 1 ): (1) during text preprocessing, identifying the writing style of a language is done first, to determine whether to perform part-of-speech tagging for disjunctively written languages or morphological analysis for conjunctively written languages. (2) next, part-of-speech tagging and morphological analysis are performed to prevent nouns from being stemmed, thus avoiding the loss of their semantics encoded in the noun prefix, and ensure that verbs and adjectives are stemmed in order to avoid noise in the data and high levels of sparsity in the resultant matrix. (3) sentiment analysis is performed during text preprocessing, before any negation morphemes are lost during verb stemming. further, it is performed after part-of-speech tagging and morphological analysis, in order to avoid the ambiguity of the negation morpheme identified for some disjunctively written languages, while also making the negation morpheme explicit for conjunctively written languages. (4) finally, during topic modeling, without stemming verbs and adjectives, the threshold counts should not be applied because a significant amount of the dataset will be excluded. in this paper, the standard insights extraction pipeline was evaluated for how well it applies to a grammatically complex and under-resourced family of languages, bantu languages. seven processes were identified as belonging to the standard pipeline (text preprocessing, dimensionality reduction, visualization, clustering, topic modeling, sentiment analysis, and network graphing) and tested for their effectiveness on two datasets of 20,000 tweets each, one composed of english and the other a mixture of english and six bantu languages. results showed that: conjunctively written languages should be distinguished from disjunctively written languages, because they require different preprocessing steps; verbs and adjectives, but not nouns, should be stemmed; threshold counts should be revised during topic modeling; and sentiment analysis should be done before verb stemming, in order to prevent the loss of the negation morpheme. future work will include implementing these recommendations and assessing their effectiveness. but honestly, who does not see this' ka sepedi ga re berekise c q z le x, 'in sepedi, we don't use the letters c, q, z, and x' ga re buwe sesotho mo limpopo social media marketing: the case of africa revisiting k-means and topic modeling: a comparison study to cluster arabic documents text clustering with local semantic kernel social network analysis using topic models topical clustering of tweets cluster preserving dimension reduction methods for document classification non-negative matrix factorization with sparsity constraints internet world stats: africa internet user stats in 2019 population by country a users guide to principal components analysis principal component analysis: a review and recent developments bantu nominal morphology promoting content in africa sparse nonnegative matrix factorization for clustering nonnegative matrix factorization for interactive topic modeling and document clustering extracting and evaluating conversational patterns in social media: a socio-semantic analysis of customers' reactions to the launch of new products using twitter streams a survey of opinion mining and sentiment analysis a comparative study of bantu noun classes dimensionality reduction for text using domain knowledge information circulation in times of ebola: twitter and the sexual transmission of ebolaby survivors aspect and tense in bantu languages introduction sparse non-negative matrix factorization using l0-constraints derivation a comparison of approaches to word class tagging: conjunctively versus disjunctively written bantu languages what patients can tell us: topic analysis for social media on breast cancer approaches for sentiment analysis on twitter: a state-ofart study tense and aspect in runyankore-rukiga: linguistic resources and analysis. master's thesis a review of non-negative matrix factorization methods for clustering high-dimensional data using t-sne dimensionality reduction: a comparative review. tiburg centre for creative computing computer assisted text analysis for social science: topic models and beyond. computation, and language a survey on visual analytics of social media data a comparative study of topic models for topic clustering of chinese web news survey of clustering algorithms key: cord-020801-3sbicp3v authors: macavaney, sean; soldaini, luca; goharian, nazli title: teaching a new dog old tricks: resurrecting multilingual retrieval using zero-shot learning date: 2020-03-24 journal: advances in information retrieval doi: 10.1007/978-3-030-45442-5_31 sha: doc_id: 20801 cord_uid: 3sbicp3v while billions of non-english speaking users rely on search engines every day, the problem of ad-hoc information retrieval is rarely studied for non-english languages. this is primarily due to a lack of data set that are suitable to train ranking algorithms. in this paper, we tackle the lack of data by leveraging pre-trained multilingual language models to transfer a retrieval system trained on english collections to non-english queries and documents. our model is evaluated in a zero-shot setting, meaning that we use them to predict relevance scores for query-document pairs in languages never seen during training. our results show that the proposed approach can significantly outperform unsupervised retrieval techniques for arabic, chinese mandarin, and spanish. we also show that augmenting the english training collection with some examples from the target language can sometimes improve performance. every day, billions of non-english speaking users [22] interact with search engines; however, commercial retrieval systems have been traditionally tailored to english queries, causing an information access divide between those who can and those who cannot speak this language [39] . non-english search applications have been equally under-studied by most information retrieval researchers. historically, ad-hoc retrieval systems have been primarily designed, trained, and evaluated on english corpora (e.g., [1, 5, 6, 23] ). more recently, a new wave of supervised state-of-the-art ranking models have been proposed by researchers [11, 14, 21, 24, 26, 35, 37] ; these models rely on neural architectures to rerank the head of search results retrieved using a traditional unsupervised ranking algorithm, such as bm25. like previous ad-hoc ranking algorithms, these methods are almost exclusively trained and evaluated on english queries and documents. the absence of rankers designed to operate on languages other than english can largely be attributed to a lack of suitable publicly available data sets. this aspect particularly limits supervised ranking methods, as they require samples for training and validation. for english, previous research relied on english collections such as trec robust 2004 [32] , the 2009-2014 trec web track [7] , and ms marco [2] . no datasets of similar size exist for other languages. while most of recent approaches have focused on ad hoc retrieval for english, some researchers have studied the problem of cross-lingual information retrieval. under this setting, document collections are typically in english, while queries get translated to several languages; sometimes, the opposite setup is used. throughout the years, several cross lingual tracks were included as part of trec. trec 6, 7, 8 [4] offer queries in english, german, dutch, spanish, french, and italian. for all three years, the document collection was kept in english. clef also hosted multiple cross-lingual ad-hoc retrieval tasks from 2000 to 2009 [3] . early systems for these tasks leveraged dictionary and statistical translation approaches, as well as other indexing optimizations [27] . more recently, approaches that rely on cross-lingual semantic representations (such as multilingual word embeddings) have been explored. for example, vulić and moens [34] proposed bwesg, an algorithm to learn word embeddings on aligned documents that can be used to calculate document-query similarity. sasaki et al. [28] leveraged a data set of wikipedia pages in 25 languages to train a learning to rank algorithm for japanese-english and swahili-english cross-language retrieval. litschko et al. [20] proposed an unsupervised framework that relies on aligned word embeddings. ultimately, while related, these approaches are only beneficial to users who can understand documents in two or more languages instead of directly tackling non-english document retrieval. a few monolingual ad-hoc data sets exist, but most are too small to train a supervised ranking method. for example, trec produced several non-english test collections: spanish [12] , chinese mandarin [31] , and arabic [25] . other languages were explored, but the document collections are no longer available. the clef initiative includes some non-english monolingual datasets, though these are primarily focused on european languages [3] . recently, zheng et al. [40] introduced sogou-qcl, a large query log dataset in mandarin. such datasets are only available for languages that already have large, established search engines. inspired by the success of neural retrieval methods, this work focuses on studying the problem of monolingual ad-hoc retrieval on non english languages using supervised neural approaches. in particular, to circumvent the lack of training data, we leverage transfer learning techniques to train arabic, mandarin, and spanish retrieval models using english training data. in the past few years, transfer learning between languages has been proven to be a remarkably effective approach for low-resource multilingual tasks (e.g. [16, 17, 29, 38] ). our model leverages a pre-trained multi-language transformer model to obtain an encoding for queries and documents in different languages; at train time, this encoding is used to predict relevance of query document pairs in english. we evaluate our models in a zero-shot setting; that is, we use them to predict relevance scores for query document pairs in languages never seen during training. by leveraging a pre-trained multilingual language model, which can be easily trained from abundant aligned [19] or unaligned [8] web text, we achieve competitive retrieval performance without having to rely on language specific relevance judgements. during the peer review of this article, a preprint [30] was published with similar observations as ours. in summary, our contributions are: -we study zero shot transfer learning for ir in non-english languages. -we propose a simple yet effective technique that leverages contextualized word embedding as multilingual encoder for query and document terms. our approach outperforms several baselines on multiple non-english collections. -we show that including additional in-language training samples may help further improve ranking performance. -we release our code for pre-processing, initial retrieval, training, and evaluation of non-english datasets. 1 we hope that this encourages others to consider cross-lingual modeling implications in future work. zero-shot multi-lingual ranking. because large-scale relevance judgments are largely absent in languages other than english, we propose a new setting to evaluate learning-to-rank approaches: zero-shot cross-lingual ranking. this setting makes use of relevance data from one language that has a considerable amount of training data (e.g., english) for model training and validation, and applies the trained model to a different language for testing. more formally, let s be a collection of relevance tuples in the source language, and t be a collection of relevance judgments from another language. each relevance tuple q, d, r consists of a query, document, and relevance score, respectively. in typical evaluation environments, s is segmented into multiple splits for training (s train ) and testing (s test ), such that there is no overlap of queries between the two splits. a ranking algorithm is tuned on s train to define the ranking function r strain (q, d) ∈ r, which is subsequently tested on s test . we propose instead tuning a model on all data from the source language (i.e., training r s (·)), and testing on a collection from the second language (t ). we evaluate on monolingual newswire datasets from three languages: arabic, mandarin, and spanish. the arabic document collection contains 384k documents (ldc2001t55), and we use topics/relevance information from the 2001-02 trec multilingual track (25 and 50 topics, respectively). for mandarin, we use 130k news articles from ldc2000t52. mandarin topics and relevance judgments are utilized from trec 5 and 6 (26 and 28 topics, respectively). finally, the spanish collection contains 58k articles from ldc2000t51, and we use topics from trec 3 and 4 (25 topics each). we use the topics, rather than the query descriptions, in all cases except trec spanish 4, in which only descriptions are provided. the topics more closely resemble real user queries than descriptions. 2 we test on these collections because they are the only document collections available from trec at this time. 3 we index the text content of each document using a modified version of anserini with support for the languages we investigate [36] . specifically, we add anserini support for lucene's arabic and spanish light stemming and stop word list (via spanishanalyzer and arabicanalyzer). we treat each character in mandarin text as a single token. modeling. we explore the following ranking models: -unsupervised baselines. we use the anserini [36] implementation of bm25, rm3 query expansion, and the sequential dependency model (sdm) as unsupervised baselines. in the spirit of the zero-shot setting, we use the default parameters from anserini (i.e., assuming no data of the target language). -pacrr [14] models n-gram relationships in the text using learned 2d convolutions and max pooling atop a query-document similarity matrix. -knrm [35] uses learned gaussian kernel pooling functions over the querydocument similarity matrix to rank documents. -vanilla bert [21] uses the bert [10] transformer model, with a dense layer atop the classification token to compute a ranking score. to support multiple languages, we use the base-multilingual-cased pretrained weights. these weights were trained on wikipedia text from 104 languages. we use the embedding layer output from base-multilingual-cased model for pacrr and knrm. in pilot studies, we investigated using cross-lingual muse vectors [8] and the output representations from bert, but found the bert embeddings to be more effective. experimental setup. we train and validate models using trec robust 2004 collection [32] . trec robust 2004 contains 249 topics, 528k documents, and 311k relevance judgments in english (folds 1-4 from [15] for training, fold 5 for validation). thus, the model is only exposed to english text in the training and validation stages (though the embedding and contextualized language models are trained on large amounts of unlabeled text in the languages). the validation dataset is used for parameter tuning and for the selection of the optimal training epoch (via ndcg@20). we train using pairwise softmax loss with adam [18] . we evaluate the performance of the trained models by re-ranking the top 100 documents retrieved with bm25. we report map, precision@20, and ndcg@20 to gauge the overall performance of our approach, and the percentage of judged documents in the top 20 ranked documents (judged@20) to evaluate how suitable the datasets are to approaches that did not contribute to the original judgments. we present the ranking results in table 1 . we first point out that there is considerable variability in the performance of the unsupervised baselines; in some cases, rm3 and sdm outperform bm25, whereas in other cases they underperform. similarly, the pacrr and knrm neural models also vary in effectiveness, though more frequently perform much worse than bm25. this makes sense because these models capture matching characteristics that are specific to english. for instance, n-gram patterns captured by pacrr for english do not necessarily transfer well to languages with different constituent order, such as arabic (vso instead of svo). an interesting observation is that the vanilla bert model (which recall is only tuned on english text) generally outperforms a variety of approaches across three test languages. this is particularly remarkable because it is a single trained model that is effective across all three languages, without any difference in parameters. the exceptions are the arabic 2001 dataset, in which it performs only comparably to bm25 and the map results for spanish. for spanish, rm3 is able to substantially improve recall (as evidenced by map), and since vanilla bert acts as a re-ranker atop bm25, it is unable to take advantage of this improved recall, despite significantly improving the precision-focused metrics. in all cases, vanilla bert exhibits judged@20 above 85%, indicating that these test collections are still valuable for evaluation. to test whether a small amount of in-language training data can further improve bert ranking performance, we conduct an experiment that uses the other collection for each language as additional training data. the in-language samples are interleaved into the english training samples. results for this fewshot setting are shown in table 2 . we find that the added topics for arabic 2001 (+50) and spanish 4 (+25) significantly improve the performance. this results in a model significantly better than bm25 for arabic 2001, which suggests that there may be substantial distributional differences in the english trec 2004 training and arabic 2001 test collections. we further back this up by training an "oracle" bert model (training on the test data) for arabic 2001, which yields a model substantially better (p@20 = 0.7340, ndcg@20 = 0.8093, map = 0.4250). we introduced a zero-shot multilingual setting for evaluation of neural ranking methods. this is an important setting due to the lack of training data available in many languages. we found that contextualized languages models (namely, bert) have a big upper-hand, and are generally more suitable for cross-lingual performance than prior models (which may rely more heavily on phenomena exclusive to english). we also found that additional in-language training data may improve the performance, though not necessarily. by releasing our code and models, we hope that cross-lingual evaluation will become more commonplace. probabilistic models of information retrieval based on measuring the divergence from randomness ms marco: a human generated machine reading comprehension dataset clef 2003 -overview of results cross-language information retrieval (clir) track overview learning to rank: from pairwise approach to listwise approach a survey of automatic query expansion in information retrieval trec 2014 web track overview word translation without parallel data deeper text understanding for ir with contextual neural language modeling bert: pre-training of deep bidirectional transformers for language understanding overview of the fourth text retrieval conference (trec-4) overview of the third text retrieval conference (trec-3) pacrr: a position-aware neural ir model for relevance matching parameters learned in the comparison of retrieval models using term dependencies google's multilingual neural machine translation system: enabling zero-shot translation cross-lingual transfer learning for pos tagging without cross-lingual resources adam: a method for stochastic optimization cross-lingual language model pretraining unsupervised cross-lingual information retrieval using monolingual data only cedr: contextualized embeddings for document ranking a markov random field model for term dependencies an introduction to neural information retrieval the trec 2002 arabic/english clir track neural information retrieval: at the end of the early years multilingual information retrieval: from research to practice cross-lingual learning-torank with shared representations cross-lingual transfer learning for multilingual task oriented dialog cross-lingual relevance transfer for document retrieval the sixth text retrieval conference (trec-6) overview of the trec 2005 robust retrieval track overview of the fifth text retrieval conference (trec-5) monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings end-to-end neural ad-hoc ranking with kernel pooling anserini: reproducible ranking baselines using lucene simple applications of bert for ad hoc document retrieval transfer learning for sequence tagging with hierarchical recurrent networks the digital language divide the 41st international acm sigir conference on research & development in information retrieval this work was supported in part by arcs foundation. key: cord-020914-7p37m92a authors: dumani, lorik; neumann, patrick j.; schenkel, ralf title: a framework for argument retrieval: ranking argument clusters by frequency and specificity date: 2020-03-17 journal: advances in information retrieval doi: 10.1007/978-3-030-45439-5_29 sha: doc_id: 20914 cord_uid: 7p37m92a computational argumentation has recently become a fast growing field of research. an argument consists of a claim, such as “we should abandon fossil fuels”, which is supported or attacked by at least one premise, for example “burning fossil fuels is one cause for global warming”. from an information retrieval perspective, an interesting task within this setting is finding the best supporting and attacking premises for a given query claim from a large corpus of arguments. since the same logical premise can be formulated differently, the system needs to avoid retrieving duplicate results and thus needs to use some form of clustering. in this paper we propose a principled probabilistic ranking framework for premises based on the idea of tf-idf that, given a query claim, first identifies highly similar claims in the corpus, and then clusters and ranks their premises, taking clusters of claims as well as the stances of query and premises into account. we compare our approach to a baseline system that uses bm25f which we outperform even with a primitive implementation of our framework utilising bert. computational argumentation is an emerging research area that has recently received increasing interest. it deals with representing and analysing arguments for controversial topics, which includes mining argument structures from large text corpora [8] . a widely accepted definition for an argument is that it consists of a claim or a standpoint, for instance "we should abandon fossil fuels", which is supported or attacked by at least one premise, for example "burning fossil fuels is one cause for global warming" or "poor people cannot afford alternative fuels" [21] . the claim is the central and usually also a controversial component, which should not be accepted by the reader without further support (by premises) [28] . from an information retrieval perspective, an interesting task within this setting is finding the best supporting (pro) and attacking (con) premises for a given query claim [31] . this has applications in many domains, including journalism and politics, and in general is relevant for making informed decisions. by now, existing (web) search engines like google only provide the most relevant documents to the user, but cannot structure their results in terms of claims and premises. there is a relatively large body of work on how arguments can be mined from text (see [8] for a recent survey). in this paper, we build upon established research on argument search engines and focus on effectively retrieving premises for a query claim from a large corpus of already mined arguments. here, a query can be either a controversial topic (e.g. "fossil fuels") or statement (e.g. "we should abandon fossil fuels"), and the task of the system is to retrieve a ranked list of pro and con premises for the query. since the same logical premise can be formulated semantically similar, an argument retrieval system has to avoid retrieving duplicate results and thus needs to use some form of clustering. previous approaches in this area have focused on estimating the relevance of premises in combination with the corresponding claims, using bm25f [30] for example. the novel contribution of this paper is a principled probabilistic ranking framework for premises that, given a query claim, first determines highly similar claims in the corpus, and then clusters and ranks their premises, taking clusters of claims as well as the stances of query and premises into account. the remainder of this paper is structured as follows: sect. 2 discusses related work. section 3 introduces necessary notation and sect. 4 presents our probabilistic ranking framework. section 5 describes details of the implementation of our framework in which we use bert [11] to capture the vectors of premises and applied hierarchical clustering. in sect. 6 we evaluate our approach with a large corpus [12] consisting of 63,250 claims and about 695,000 premises and compare it to a baseline system that uses bm25f. section 7 concludes the paper and discusses ideas for future work. stab et al. [27] present argumentext [4] . their argument retrieval system first retrieves relevant documents, then it identifies relevant arguments. we do not address the argument mining task. our work is more similar to the work of wachsmuth et al. [30] who present args [3] , one of the first prototypes of an argument search engine. args operates on arguments crawled from five debate portals (such as debate.org and idebate.org). given a user's keyword query, the system retrieves, ranks, and presents premises supporting and attacking the query, taking similarity of the query with the premise, its corresponding claim, and other contextual information into account. they apply a standard bm25f ranking model implemented on top of lucene. in our prior work [12] , we build on the work of wachsmuth et al. and systematically compared 196 methods for identification of similar claims by textual similarity, using a comparable large corpus of (claim, premise) pairs crawled from several debate portals. the results imply that matching similar claims to a query claim with divergence from randomness (dfr) [2] yields slightly better results than bm25 [24] . thus, we will make use of dfr to find the most similar claims to a query claim. the work on argument quality and ranking is also a subarea addressed in the community. habernal and gurevych address the relevance of premises [15] . they confronted users in a crowdsourced task with pairs of premises to decide which premise is more convincing. then, they used a bidirectional lstm to predict which of two given arguments is better. in a follow-up work [14] , they also investigate in the constitution of convincing arguments. wachsmuth et al. [32] consider the problem of judging the relevance of arguments. an overview of the work on computational argumentation quality in natural language, including theories and approaches is provided by them. their work can be used to determine the quality of arguments and thus also for the ranking. reimers et al. [23] deal with clustering premises. elmo [22] and bert [11] were used to classify and cluster topic-dependent arguments. they improve the baseline for both tasks but also recognise that arguments can address multiple aspects and therefore belong to multiple clusters. we build upon this work by using bert to cluster claims as well as premises. as they do, we use a hard clustering algorithm and leave soft clustering algorithms for future work since this paper intends to set up the foundation and show the potential of the framework. we assume that we work with a large corpus of argumentative text, for example collections of political speeches or forum discussions, that has already been mined and transferred into claims with the corresponding premises and stances. we consider the following problem: given a controversial claim or topic, for example "we should abandon fossil fuels", a user searches for the most important premises from the corpus supporting or attacking it. it is important to take into account that even if different claims or premises are semantically equivalent, they will usually be formulated in different ways, so we will consider clusters of claims (and clusters of premises) with the same meaning instead of isolated claims and premises. finding this clustering of premises and claims as well as choosing a good representative of each result cluster to show to the user are additional tasks of the system. we will now introduce some notations used in the remainder of the paper. let c be the set of all claims in our corpus. a claim cluster γ j ⊆ c is a subset of claims with the same meaning, and a claim clustering γ = {γ 1 , γ 2 , . . .} is a disjoint partitioning of c into claim clusters. the function γ : c → γ assigns to a claim c i ∈ c its corresponding cluster γ j (which exists and is unique). let p be the set of all premises in the corpus. we write p → c if p ∈ p appears as a premise for c ∈ c in the corpus, and p + → c if p supports c. similar to claim clusters, we consider premise clusters π j ⊆ p of premises with the same meaning and the corresponding premise clustering π = {π 1 , π 2 , . . .} as a disjoint partitioning of p into premise clusters. the function π : p → π assigns to a premise p i ∈ p its corresponding premise cluster π j . for a premise cluster π j , c(π j ) ⊆ c denotes the set of claims attacked or supported by premises in π j . note that two subsets c(π j ), c(π l ) with j = l may overlap for different premise clusters because the same premise or premises from the same cluster (e.g. 'it is very expensive') can support or attack very different claims (e.g. 'nuclear energy' and 'health care'). figure 1 gives an example of a corpus with similar claims and premises. a claim may come with a stance, and different claims may have different stances, even though they deal with the same topic. to see why this is important, consider the following example claims and their stances: c 1 = "we should use fossil fuels" (positive stance), c 2 = "we should abandon fossil fuels" (negative stance), c 3 = "fossil fuels" (neutral stance), and c 4 = "should fossil fuels be used?" (neutral stance). we treat claims with neutral stances as if they had a positive stance. for a query asking for "increase usage of fossil fuels", supporting premises would be premises that support c 1 , c 3 , c 4 , but also premises that attack c 2 . similarly, attacking premises would be those attacking c 1 , c 3 , c 4 or supporting c 2 . let q and c be query and claim on the same topic, then if q and c have the same stance, a premise supporting c will also support q. also, if q and c have opposite stance, a premise supporting c will attack q. we write q ↑↑ c if the stances of q and c are aligned and q ↑↓ c otherwise. we further assume that all claims within the same cluster have the same stance. given a query claim q, the goal is to find the best clusters of supporting and attacking premises π + , π − for q in the corpus. here, p (π + |q) defines the probability that a user would pick π as the supporting cluster of premises for q amongst all premise clusters in the corpus. furthermore, p (π − |q) is defined analogously for attacking clusters. to compute these probabilities, we first consider single premises and claims and extend this to clusters afterwards; we then will discuss how stances can be taken into account. we will restrict the examination to supporting premises, attacking premises are computed analogously. first we estimate the probability p (p + |q) that the user picks the supporting premise p for query claim q. we assume the following user model: to pick a supporting premise, the user initially selects a matching claim c for q amongst all claims in the corpus with probability p (c|q), and then picks a premise p with probability p (p + |c, q) amongst all supporting premises of this claim. considering that p may support multiple claims, p (p + |q) can thus be written as where c∈c p (c|q) = 1. since p (p + |c, q) = 0 if p is not a premise of c as the user picks only premises of c, we can restrict the summation to claims for which p appears as premise. in addition, we assume that p (p + |c, q) = p (p + |c), i.e. p is picked as support for c independently from q. to include the stances of query and claims, we must consider that an attacking premise of a claim with opposite stance to the query can also be picked as a supporting premise of the query. this results in the following updated expression: with p (p − |c) describing the probability that p is picked as an attacking premise of claim c, p (q ↑↑ c) being the probability that q and c have the same stance, and p (q ↑↓ c) being the probability that q and c have opposite stance. finally, to compute the probability of picking a premise cluster instead of a single premise, we additionally need to aggregate over all premises in the cluster; this works since premise clusters are disjoint by construction: note that if the user does not make a distinction between supporting and attacking clusters, but instead just wants good premise clusters, we can extend the experiment such that the user first throws a fair coin to decide if he will pick a supporting or attacking premise cluster. this leads to the following probability for picking premise cluster π j : we now present possible estimators for each of the probabilities used in our ranking framework. while we think that these estimators are reasonable, there are clearly many other ways for their estimation, for example taking argument quality [32] into account; this is left for future work. p (c|q) denotes the probability that c is "relevant" for query q, which can be estimated using standard text retrieval approaches; in our experiments, we will use divergence from randomness [2] . since most retrieval approaches are not probabilistic in nature, we need to recalibrate the computed scores such that their values correspond to probabilities. p (p + |c) is the probability that p is chosen amongst all supporting premises of c. here, we will not use textual similarity of p and c since good premises supporting or attacking a claim often have only small textual overlap with the claim. as an example, consider a user searching for premises supporting the claim "we should abandon fossil fuels". a good premise could be "wind and solar energy can already provide most of the needed energy", which does not overlap at all with the claim. instead, we will estimate this based on two different frequency statistics: the premise frequency pf(p + , c), which describes the frequency with which premise p is used as support for claims within c's claim cluster, i.e. with the same meaning as c, and the claim frequency cf(p + ), which is the number of claim clusters for which premise p is used as support. intuitively, we prefer premises that appear frequently within a claim cluster, and we may want to give lower weight to premises that appear within most or even all claim clusters. this is exactly the same principle used in the tf-idf term weight [25] . we therefore use the inverse claim frequency icf(p + ) in a form similar to standard idf. since the same "semantic" premise can appear in different textual formulations, we will consider its premise cluster instead of the actual premise when computing pf(p + ) and icf(p + ). we can formalise this as follows: ,∃c ∈γ such that p + →c }| we then estimate p (p + |c) as where z is a normalisation term computed as the sum of the unnormalised pf·icf products over all candidate premises; this is not needed for ranking the premises. estimating the probability that two claims (or, more generally, two statements) have the same stance is a surprisingly hard problem that has not yet been solved, especially if two statements have different stances [16] . we therefore omit this part of the framework in this paper and instead focus on the evaluation of the other parts, which form the core of the framework. now we describe the concrete implementation of the framework, i.e. the clustering of claims as well as the clustering of premises. clustering the claims. we cluster the claims in an offline operation with hierarchical clustering. for each claim, we calculate its embeddings using bert [11] 1 . this allowed us to create an agglomerative clustering [17] , i.e. a bottom-up approach 2 . compared to k-means [20] , hierarchical clustering has the advantage of not needing to provide the number of resulting clusters beforehand. in general, only few parameters are expected here, which leads to less overfitting. for example, it expects only a method to determine the distance between two vectors and a method to link clusters. for the former we have taken the often used euclidean distance function, and for the latter the widely used average linkage method [26] , which calculates the mean of two clusters for connecting both. in order to determine a cutoff value for the clustering, we took the implementation of langfelder et al. [19] , which produces a dynamic tree cut. contrasting constant height cut, amongst others it is capable of identifying nested clusters. clustering the premises and computing results. since there are usually many more premises than claims, precomputing their clustering is not viable. instead, we use an approximation that clusters relevant premises at query time. after a query claim q arrives in the system, the top k most similar claims r = {r i |1 ≤ i ≤ k} are retrieved from the corpus using divergence from randomness [2] . at the same time, we obtain p (c|q) (after normalisation). then the corresponding claim clusters are determined and all their premises m = {p|∃c ∈ r, ∃c ∈ γ(c) such that p → c } are retrieved from the corpus. from the set m , an expanded set m is then constructed by adding, for each premise in m , its n most similar premises from the corpus, according to the state-of-the-art standard retrieval method bm25 3 . this ensures that our premise set is large enough to compute claim frequencies. using bert embeddings again, this expanded premise set is first hierarchically clustered and then a dynamic tree cut is made. unfortunately, bert does not support more than 512 tokens, but some premises are longer. we have thus implemented the three variants bert 512 , bert sw , and bert sent . with bert 512 we simply truncate a premise after 512 tokens, i.e. the embeddings only refer to the first 512 tokens of a text. with bert sw we utilised a sliding window, i.e. for premises with more than 512 tokens we always considered only text spans with a maximum length of 512, but always shifted the window to the right by 256 until the end of the premise in order to keep as much context information as possible. hence, for a text s that has more than 512 tokens, we get |s| 256 embeddings, of which the average is calculated pointwise at the end. with bert sent we determine embeddings for each sentence of a premise and finally form the average of all embeddings for a premise pointwise. after the clustering, premise frequency and claim frequency are computed for each premise in the original set m as well as the final probabilities for each premise cluster. lastly, the clusters have to be presented to the user in an adequate format. therefore, a premise is chosen from each cluster as a representative. in our implementation, this is the premise p with the longest text. now we describe the evaluation of our approach which clusters and ranks premises with respect to given queries. first we explain the dataset and the baseline we used, then we describe the setup of the ground-truth of premise clusters and the evaluation metrics. finally, we present the evaluation results. we used the dataset of our prior work [12] which consists of 63,250 claims and about 695,000 premises extracted from four debate portals. after clustering, the 63,250 claims were distributed over a total of 10,611 clusters. the average cluster size is about 6.1, the median is 5. the final evaluation corpus in this prior work consists of triples of the form (query claim, result claim, result premise) for a total of 232 query claims which are all related to the topic "energy". result claims are these which were identified by pooling the top five similar claims for a query claim using standard ir methods. the result premises are associated with the corresponding result claims. using this final evaluation corpus, we randomly selected 30 query claims and extracted 1,221 individual triples. as the premises later had to be clustered manually, we made sure that the union of the result premises of all result claims for each of the 30 query claims did not exceed the number 50. the relevance of each premise for the corresponding query claim was assessed by two annotators on a three-fold relevance scale as "very relevant", "relevant", and "not relevant". note that the actual result claims were not shown to the assessors. the inter-annotator agreement, measured with krippendorff's α [18] , was 0.480 on a nominal scale and 0.689 on an interval scale, indicating that the annotation is robust. disagreements between the annotations were discussed in order to achieve an agreement. after removing 26 triples because their premises were annotated as "spam" or "other", we obtained a final corpus corp eval of 1,195 triples consisting of 389 very relevant, 139 relevant, and 667 not relevant premises for the 30 queries. as a baseline system, we implemented the approach proposed by wachsmuth et al. [30] that indexes premises together with their claims and uses a bm25f scoring model [24] , giving more importance to the claim than to the premise 4 . since they gave no parameter settings, we use the default values 1.2 and 0.75 for k 1 and b, respectively [7] . as wachsmuth et al. describe, the three fields conclusion, full arguments, and discussion were added to the bm25f method. in the field 'conclusion' we store the result claim, in the field 'full argument' the premise together with the result claim. the field 'discussion' reflects the context and contains the whole debate, i.e. the result claim and all its premises. in order to setup a ground-truth for our experiments, we derive a ground-truth corpus corp gt by including only the 528 triples from corp eval where the premises were assessed either as relevant or as very relevant to the query claim. for each of the 30 query claims, the premises of corp gt were clustered by two annotators. they were shown all result premises for a query claim, then they clustered them based on their subjectively perceived semantic similarity. one annotated, the other checked. again, discordances were discussed in order to achieve an agreement. please note, that the annotators were instructed to assign only premises with the same relevance level to the same (ground-truth) cluster, which also served as a pre-filter to reduce complexity. since we are searching for similar claims to a query claim in the first step, it is essential to know their stances in order to identify the stances of the premises to a query, so that the clustering of the premises can be divided into pros and cons. however, as it is (still) an unsolved problem to match the stance with a good probability [16] , we will ignore the stance in this experiment and tackle this task in future work. for each query, the ground-truth g then consists of clusters g 1 , . . . , g t such that each g i contains premises with the same meaning and with the same relevance level assigned by the assessors. the relevance level assigned to premises in cluster g i is denoted by rel(g i ). we assume that the clusters are numbered such that i ≤ j implies rel(g i ) ≥ rel(g j ). note that premises assessed as irrelevant are not included in any ground-truth cluster. the user now asks for a summary of premises supporting and attacking the query claim. a good system will now retrieve, for a given query, a list of premises that (1) covers the different premises clusters in the ground-truth, (2) retrieves premises from highly relevant clusters before premises from "only" relevant clusters, and (3) does not retrieve multiple premises from the same cluster. note that this setup is different from standard adhoc retrieval since the system must identify the various aspects of the results. it also differs from diversity-aware and novelty-aware approaches [9] since the user is interested in all aspects of the query, but asks for a single representative result per aspect only. to evaluate the quality of the retrieved results, we use a simplified variant of α-ndcg [10] , which we will later extend to work with clusters as results. we consider two sub-tasks here. in task a, the system retrieves a list of premise clusters, whereas in task b, the system needs to additionally decide for one representative premise from each cluster to show to the user. a system that would not at all consider premise clusters, for example by indexing and searching directly at the level of premises, can solve task b only. we will now first explain how to evaluate task b with a simplified variant of α-ndcg [10] where we set α = 1.0 and consider each ground-truth cluster as an information nugget. the system returns a sorted list of premises r = (r 1 , r 2 , ..., r k ) where r 1 is the topmost result; we assume that there are no ties in the ranking (otherwise, ties will be broken arbitrarily). to compute the gain of the result at rank i, we first check if it appears in any ground-truth cluster; if not, its gain is 0. otherwise, let g j be the ground-truth cluster of r i . if no result of this cluster has appeared up to rank i − 1, the gain of r i is rel(g j ); otherwise, its gain is 0 since it does not contribute a novel aspect. as in standard ndcg, the discount for rank i is computed as 1 log 2 i if i ≥ 2 and 1 otherwise. in the ideal gain vector needed for computing ndcg, the component at position i is the relevance level rel(g i ) of g i , which is ideal since ground-truth clusters are ordered by descending relevance level. to illustrate the principles of our metrics for task b, consider the groundtruth shown in fig. 2 . the left visualises the ground-truth for a query with three clusters: g 1 which is highly relevant (score 2), and g 2 and g 3 which are relevant (score 1). on the right are the premises that the system has returned, sorted by their estimated relevance. the ideal gain vector for this ground-truth is 2, 1, 1, corresponding to an ideal discounted cumulative gain of 2 + 1 + 1 log 2 (3) ≈ 3.63. the gains for the result list retrieved by the system are 2, 1, 0, 0, 0, 0, 0, 1 (since duplicate results from the same cluster are assigned a gain of 0), corresponding to a discounted cumulative gain of 2 + 1 + 1 log 2 (8) = 10 3 . the ndcg of this result list is thus (approximately) 10/3 3.63 = 0.92. task a is more difficult to evaluate since we do not have a list of premises, but of premise clusters (i.e. sets of premises); existing ndcg variants cannot be applied here since they operate on lists of documents, not clusters. to be able to apply the evaluation machinery introduced for task b, we generate all possible result lists from the list of clusters, compute ndcg for each list, and aggregate the per-list values using either average, max, or min. if, e.g. our system returns two clusters π 1 = {p 1 , p 2 }, π 2 = {p 3 , p 4 }, then the result lists (p 1 , p 3 ), (p 1 , p 4 ), (p 2 , p 3 ), (p 2 , p 4 ) are generated. the results of our evaluation can be found in tables 1 and 2 . table 1 shows the evaluation of task b, i.e. the mean ndcg@{5,10} values for all queries. since this process requires the selection of a representative and is difficult to decide even for humans, we have simply taken the longest premise. the table reveals that the implementation bert sw , which calculates the premises' embeddings using the 'sliding window' method, performs best. for bert sw and bert 512 , the observed improvements over the baseline bm25f are statistically significant for ndcg@5 (tested with welch's t-test [33] with p = 0.05). since the baseline only returns a ranked list and not a ranked list of clusters, we interpret this list as clusters each with one entry in table 2 . we can infer from table 2 that bert sw performs best. using welch's t-test with p = 0.05 once more, the observed improvement over the baseline is statistically significant for the mean average ndcg@5 but not for ndcg@10. still, the results imply that bert sw is at least as good as the baseline for ndcg@10. note that bert sw has not even been fine-tuned. moreover, the results in table 2 unambiguously underline the importance of clustering and even more the choice of the correct representative. if we always chose the best representative, then we always have the maximum value and vice versa. note that the premises used in our experiment are extracted from debate portals and thus are not always premises in the sense of argumentation theory, as they often consider more than one aspect. clustering and ranking premises is a very difficult, but important task, since a user searching for premises wants them to be presented in a compact and complete format. in this paper, we made use of the idea of tf-idf and presented a framework for clustering and ranking premises. we used premises from debate portals, which are partially from moderated websites, and of high quality but usually very long. we showed that ranking premises by their frequency and specificity has great potential since our implementation using bert and a hard clustering algorithm outperforms the baseline bm25f although the model was not fine-tuned and the premises actually cover many aspects, so a premise could be assigned to several clusters. in future work we will integrate soft clustering algorithms, for which we first have to break down the premises into their individual parts (e.g. argumentative discourse units and elementary discourse units) [29] . in addition, we will train different fine-tunings for different sentence embedding models in order to achieve better results. in our implementation, the clustering of the 695,000 premises was not precalculated, instead it was determined dynamically for a smaller subset, since this is a very computationally intensive task. therefore, we will also precalculate the clusters of premises. to stay within the scope of this paper, we have assumed a flat hierarchy for argument graphs, where an argument consists of a claim and many premises, as they occur e.g. in debate portals. in the future we will extend our framework with more complex structures with more layers. contextual string embeddings for sequence labeling probabilistic models of information retrieval based on measuring the divergence from randomness open source connections. bm25f in lucene standard values for k1 and b for bm25 five years of argument mining: a data-driven analysis a comparative analysis of cascade measures for novelty and diversity novelty and diversity in information retrieval evaluation bert: pre-training of deep bidirectional transformers for language understanding a systematic comparison of methods for finding good premises for claims are you convinced? choosing the more convincing evidence with a siamese network what makes a convincing argument? empirical analysis and detecting attributes of convincingness in web argumentation which argument is more convincing? analyzing and predicting convincingness of web arguments using bidirectional lstm a retrospective analysis of the fake news challenge stancedetection task algorithms for clustering data estimating the reliability, systematic error and random error of interval data dynamic tree cut: in-depth description, tests and applications least squares quantization in pcm from argument diagrams to argumentation mining in texts: a survey proceedings of the 2018 conference of the north american chapter of the association for computational linguistics: human language technologies, naacl-hlt 2018 classification and clustering of arguments with contextualized word embeddings the probabilistic relevance framework: bm25 and beyond a vector space model for automatic indexing a statistical method for evaluating systematic relationships argumentext: searching for arguments in heterogeneous sources identifying argumentative discourse structures in persuasive essays parallel discourse annotations on a corpus of short texts building an argument search engine for the web pagerank" for argument relevance computational argumentation quality assessment in natural language the generalization of 'student's' problem when several different population variances are involved we would like to thank manuel biertz, christin katharina kreutz, alex witry, and tobias zeimetz for their invaluable help in the annotations. key: cord-020806-lof49r72 authors: landin, alfonso; parapar, javier; barreiro, álvaro title: novel and diverse recommendations by leveraging linear models with user and item embeddings date: 2020-03-24 journal: advances in information retrieval doi: 10.1007/978-3-030-45442-5_27 sha: doc_id: 20806 cord_uid: lof49r72 nowadays, item recommendation is an increasing concern for many companies. users tend to be more reactive than proactive for solving information needs. recommendation accuracy became the most studied aspect of the quality of the suggestions. however, novel and diverse suggestions also contribute to user satisfaction. unfortunately, it is common to harm those two aspects when optimizing recommendation accuracy. in this paper, we present eer, a linear model for the top-n recommendation task, which takes advantage of user and item embeddings for improving novelty and diversity without harming accuracy. in recent years, the way users access services has shifted from a proactive approach, where the user actively looks for the information, to one where the users take a more passive role, and content is suggested to them. within this transformation, recommender systems have played a pivotal role, enabling an increase in user engagement and revenue. recommender systems are usually classified into three families [1] . the first approach, content-based systems, use item metadata to produce recommendations [7] . the second family, collaborative filtering, is composed of systems that exploit the past interactions of the users with the items to compute the recommendations [10, 17] . these interactions can take several forms, such as ratings, clicks, purchases. finally, hybrid approaches combine both to generate suggestions. collaborative filtering (cf) systems can be divided into memory-based systems, that use the information about these interactions directly to compute the recommendations, and model-based systems, that build models from this information that are later used to make the recommendations. in this paper, we will present a cf model to address the top-n recommendation task [4] . the objective of a top-n recommender is to produce a ranked list of items for each user. these systems can be evaluated using traditional ir metrics over the rankings [2, 4] . in that evaluation approach, accuracy is usually the most important metric and has been the focus of previous research and competitions [3] . nevertheless, other properties are also important, such as diversity and novelty [8, 13] . diversity is the ability of the system to make recommendations that include items equitably from the whole catalog, which is usually desired by vendors [5, 22] . on the other hand, novelty is the capacity of the system to produce unexpected recommendations. this characteristic is a proxy for serendipity, associated with higher user engagement and satisfaction [6] . all these properties, accuracy, diversity and novelty, are linked to the extent that raising accuracy usually lowers the best achievable results in the other properties [11] . in this paper, we propose a method to augment an existing recommendation linear model to make more diverse and novel recommendations, while maintaining similar accuracy results. we do so by making use of user and item embeddings that are able to capture non-linear relations thanks to the way they are obtained [21] . experiments conducted on three datasets show that our proposal outperforms the original model in both novelty and diversity while maintaining similar levels of accuracy. with reproducibility in mind, we also make the software used for the experiments publicly available 1 . in this section, we introduce fism, the existing recommendation method we augment in our proposal. after that, we introduce prefs2vec, the user and item embedding model used to make this enhancement. fism is a state-of-the-art model-based recommender system proposed by kabbur et al [9] . this method learns a low rank factorization of an item-item similarity matrix, which is later used to compute the scores to make the predictions. this method is an evolution of a previous method, slim [16] , that learns this matrix without factorizing it. factorizing the similarity matrix allows fism to overcome slim's limitation of not being able to learn a similarity other than zero for items that have never been rated both by at least one user. as a side effect of this factorization, it lowers the space complexity from o( |i|. it also drops the non-negativity constraint and the constraint that the diagonal of the similarity matrix has to contain zeroes. as a consequence of these changes, the optimization problem can be solved using regular gradient descent algorithms, instead of the coordinated gradient descent used by slim, leading to faster training times. embedding models allow transforming high-dimensional and sparse vector representations, such as classical one-hot and bag-of-words, into a space with much lower dimensionality. in particular, previous word embedding models, that produce fixed-length dense representations, have proven to be more effective in several npl tasks [14, 15, 19] . recently, prefs2vec [21] , a new embedding model for obtaining dense user and item representations, an adaptation of the cbow model [14] , has shown that these embeddings can be useful for the top-n recommendation task. when used with a memory-based recommender, they are more efficient than the classical representation [21] . the results show that not only they can improve the accuracy of the results, but also their novelty and diversity. the versatility of this embedding model, in particular of the underlying neural model and the way it is trained, is also shown in [12] . here the prediction capabilities of the neural model are used directly in a probabilistic recommender. in this section, we present our method to enhance diversity and novelty in recommendation, explaining how the model is trained and used to produce recommendations. firstly, we introduce how the product of user and item embeddings (based on prefs2vec) can be used to make recommendations, which is later used as part of the proposal. as representations of users and items in a space with much lower dimensionality, prefs2vec embeddings can be viewed as latent vectors. however, there is no sense in multiplying both item and user vectors as they have different basis even when they have the same dimensions. this is a consequence of learning the item and user representations independently, how prefs2vec initializes the parameters of the model and how the training is performed. however, it is possible to make this product if we can compute a change of basis matrix t ∈ r d×d to transform the user embeddings into the item embeddings space. this way we can calculate an estimated ratings matrixr using the simple matrix multiplication: where e ∈ r |u |×d is the matrix of user embeddings, and f ∈ r |i|×d is the matrix of item embeddings, one embedding in each row. the transformation matrix t is learned by solving the optimization problem with 2 regularization: where r is the ratings matrix and β e is the regularization hyperparameter. this problem can be solved using gradient descent algorithms. once the transformation matrix has been trained, recommendations can be produced by computing the estimated rating matrixr as described in eq. 1. recommendations are made to each user by sorting the corresponding row and picking the top-n items not already rated by the user. we dubbed this recommender elp, short for embedding linear product, and we present its performance in table 3 in the experiments section. we have seen that linear methods, like fism, can obtain good accuracy figures. on the other side, as results in table 3 show, elp is able to provide good figures in novelty and diversity, thanks to the embedding model capturing non-linear relations between users and items. we propose to capture both properties by joining the models together in the eer model (embedding enhanced recommender). we choose the rmse variant of fism as it matches the loss used in elp. we also use a trainable scalar parameter α to joint the models, as the scores obtained from each recommender need not be on the same scale. this results in the following equation to calculate the estimated ratings matrix:r where p ∈ r |i|×k and q ∈ r k×|i| are the low rank factorization of the item-item similarity matrix. the parameters of the model, p , q, t and α, are learned by solving the joint 2 regularized optimization problem resulting from the previous joint equation, using standard gradient descent algorithms: minimize p ,q ,t ,α similar to the case of elp, once the parameters are learned, we make the recommendations by calculating the estimated ratings matrix using eq. 3, sorting each row and picking the top-n items not yet rated by the user corresponding to that row. in this section, we introduce the datasets used to perform our experiments, the evaluation protocol followed and the metrics used. after that, we present the results of our experiments. to evaluate our proposal, we conducted a series of experiments on several datasets, from different domains: the movielens 20m dataset 2 , a movie dataset, the book dataset librarything, and the beeradvocate dataset 3 , consisting of beer reviews. table 1 shows statistics of each collection. in order to perform the experiments, the datasets were divided randomly into train and test sets. the training dataset consisted of 80% of the ratings of each user, with the remaining 20% forming the test dataset. we follow the testitems evaluation methodology [2] to evaluate the performance. to assess the accuracy of the rankings, we use normalized discounted cumulative gain (ndcg), using the standard formulation as described in [23] , with the ratings in the test set as graded relevance judgments. we considered only items with a rating of 4 or more, on a 5 point scale, to be relevant for evaluation purposes. we also measured the diversity of the recommendations using the complement of the gini index [5] . finally, we use the mean self-information (msi) [24] to assess the novelty of the recommendations. all the metrics are evaluated at cut-off 100 because it has shown to be more robust with respect to the sparsity and popularity biases than sallower cut-offs [20] . we perform a wilcoxon test [18] to asses the statistical significance of the improvements regarding ndcg@100 and msi@100, with p < 0.01. we cannot apply it to the gini index because we are using a paired test and gini is a global metric. results in table 3 are annotated with their statistical significance. we performed a grid search over the hyperparameters of the original model and our proposal tuning them to maximize ndcg@100. although we aim to increase diversity and novelty, we want the recommendations to be effective, which is why the tuning is done over accuracy. for the parameters of the prefs2vec model, we took those that performed better in [21] . for reproducibility's sake, values for the best hyperparameters for each collection can be consulted in table 2 . table 3 shows the values of ndcg@100, gini@100 and msi@100 for fism, eer and elp. the results show that eer outperforms the baseline (fism) on both novelty and diversity. it also surpasses it on accuracy on the movielens 20m and librarything datasets. in the case of diversity, we can see important table 2 . best values of the hyperparameters for ndcg@100 for fism and our proposals eer and elp. librarything beeradvocate fism β = 1, k = 1000 β = 1000, k = 1000 β = 50, k = 1000 elp βe = 0.1 βe = 10 βe = 10 eer β = 0.1, βe = 1, k = 1000 β = 500, βe = 10, k = 1000 β = 10, βe = 1, k = 1000 improvements. elp, on the other hand, obtains the best diversity and novelty values, but this comes with a big reduction in accuracy. it is common in the field of recommender systems for methods with lower accuracy to have higher values in diversity and novelty. we believe that the ability of the embeddings to find nonlinear relationships contributes to the model novelty and diversity. this property of the model allows it, for example, to discover relationships between popular and not so popular items leading to better diversity. moreover, the integration in the linear model allows to keep its advantage in terms on accuracy, clearly suparssing the use of embeddings in isolatation (elp). in this paper, we presented eer, a method to enhance an existing recommendation algorithm to produce recommendations that are both more diverse and novel, while maintaining similar levels on accuracy. this process is done by combining two models, a linear one that is able to obtain good levels of accuracy, with a model based in an embedding technique that extracts non-linear relationships, allowing it to produce more diverse and novel recommendations. as future work, we plan to apply the same technique to other recommender systems, examining if it can be applied in general to enhance the recommendations, independently of the base algorithm chosen for the task. we also envision studying the effects that varying the value of α in eq. 3 has on the recommendations. fab: content-based, collaborative recommendation precision-oriented evaluation of recommender systems the netflix prize performance of recommender algorithms on top-n recommendation tasks blockbuster culture's next rise or fall: the impact of recommender systems on sales diversity beyond accuracy: evaluating recommender systems by coverage and serendipity semantics-aware content-based recommender systems recommender systems handbook evaluating collaborative filtering recommender systems fism: factored item similarity models for top-n recommender systems advances in collaborative filtering when diversity met accuracy: a story of recommender systems prin: a probabilistic recommender with item priors and neural models being accurate is not enough: how accuracy metrics have hurt recommender systems efficient estimation of word representations in vector space distributed representations of words and phrases and their compositionality slim: sparse linear methods for top-n recommender systems a comprehensive survey of neighborhoodbased recommendation methods using score distributions to compare statistical significance tests for information retrieval evaluation glove: global vectors for word representation on the robustness and discriminative power of information retrieval metrics for top-n recommendation collaborative filtering embeddings for memory-based recommender systems item-based relevance modelling of recommendations for getting rid of long tail products. knowl.-based syst a theoretical analysis of ndcg ranking measures solving the apparent diversity-accuracy dilemma of recommender systems key: cord-020846-mfh1ope6 authors: zlabinger, markus; hofstätter, sebastian; rekabsaz, navid; hanbury, allan title: dsr: a collection for the evaluation of graded disease-symptom relations date: 2020-03-24 journal: advances in information retrieval doi: 10.1007/978-3-030-45442-5_54 sha: doc_id: 20846 cord_uid: mfh1ope6 the effective extraction of ranked disease-symptom relationships is a critical component in various medical tasks, including computer-assisted medical diagnosis or the discovery of unexpected associations between diseases. while existing disease-symptom relationship extraction methods are used as the foundation in the various medical tasks, no collection is available to systematically evaluate the performance of such methods. in this paper, we introduce the disease-symptom relation collection (dsr-collection), created by five physicians as expert annotators. we provide graded symptom judgments for diseases by differentiating between relevant symptoms and primary symptoms. further, we provide several strong baselines, based on the methods used in previous studies. the first method is based on word embeddings, and the second on co-occurrences of mesh-keywords of medical articles. for the co-occurrence method, we propose an adaption in which not only keywords are considered, but also the full text of medical articles. the evaluation on the dsr-collection shows the effectiveness of the proposed adaption in terms of ndcg, precision, and recall. disease-symptom knowledge bases are the foundation for many medical tasks -including medical diagnosis [9] or the discovery of unexpected associations between diseases [12, 14] . most knowledge bases only capture a binary relationship between diseases and symptoms, neglecting the degree of the importance between a symptoms and a disease. for example, abdominal pain and nausea are both symptoms of an appendicitis, but while abdominal pain is a key differentiating factor, nausea does only little to distinguish appendicitis from other diseases of the digestive system. while several disease-symptom extraction methods have been proposed that retrieve a ranked list of symptoms for a disease [7, 10, 13, 14] , no collection is available to systematically evaluate the performance of such methods [11] . while these method are extensively used in downstream tasks, e.g., to increase the accuracy of computer-assisted medical diagnosis [9] , their effectiveness for disease-symptom extraction remains unclear. in this paper, we introduce the disease-s ymptom relation collection (dsrcollection) for the evaluation of graded disease-symptom relations. the collection is annotated by five physicians and contains 235 symptoms for 20 diseases. we label the symptoms using graded judgments [5] , where we differentiate between: relevant symptoms (graded as 1) and primary symptoms (graded as 2). primary symptoms-also called cardinal symptoms-are the leading symptoms that guide physicians in the process of disease diagnosis. the graded judgments allow us for the first time to measure the importance of different symptoms with grade-based metrics, such as ndcg [4] . as baselines, we implement two methods from previous studies to compute graded disease-symptom relations: in the first method [10] , the relation is the cosine similarity between the word vectors of a disease and a symptom, taken from a word embedding model. in the second method [14] , the relation between a disease and symptom is calculated based on their co-occurrence in the meshkeywords 1 of medical articles. we describe limitations of the keyword-based method [14] and propose an adaption in which we calculate the relations not only on keywords of medical articles, but also on the full text and the title. we evaluate the baselines on the dsr-collection to compare their effectiveness in the extraction of graded disease-symptom relations. as evaluation metrics, we consider precision, recall, and ndcg. for all three metrics, our proposed adapted version of the keyword-based method outperforms the other methods, providing a strong baseline for the dsr-collection. the contributions of this paper are the following: -we introduce the dsr-collection for the evaluation of graded disease-symptom relations. we make the collection freely available to the research community. 2 -we compare various baselines on the dsr-collection to give insights on their effectiveness in the extraction of disease-symptom relations. in this section, we describe the new disease-s ymptom relation collection (dsr-collection) for the evaluation of disease-symptom relations. we create the collection in two steps: in the first step, relevant disease-symptom pairs (e.g. appendicitis-nausea) are collected by two physicians. they collect the pairs in a collaborative effort from high-quality sources, including medical textbooks and an online information service 3 that is curated by medical experts. in the second step, the primary symptoms of the collected disease-symptom pairs are annotated. the annotation of primary symptoms is conducted to incorporate a graded relevance information into the collection. for the annotation procedure, we develop guidelines that briefly describe the task and an online annotation tool. then, the annotation of primary symptoms is conducted by three physicians. the final label is obtained by a majority voting. based on the labels obtained from the majority voting, we assign the relevance score 2 to primary symptoms and 1 to the other symptoms, which we call relevant symptoms. in total, the dsr-collection contains relevant symptoms and primary symptoms for 20 diseases. we give an overview of the collection in table 1 . for the 20 diseases, the collection contains a total of 235 symptoms, of which 55 are labeled as primary symptom (about 25%). the top-3 most occurring symptoms are: fatigue which appears for 15 of the 20 diseases, fever which appears for 10, and coughing which appears for 7. notice that the diseases are selected from different medical disciplines: mental (e.g. depression), dental (e.g. periodontitis), digestive (e.g. appendicitis), and respiration (e.g. asthma). we calculate the inter-annotator agreement using fleiss' kappa [2] , a statistical measure to compute the agreement for three or more annotators. for the annotation of the primary symptoms, we measure a kappa value of κ = 0.61, which indicates a substantial agreement between the three annotators [6] . individual κ-values per disease are reported in table 1 . by analyzing the disagreements, we found that the annotators labeled primary symptoms with varying frequencies: the first annotator annotated on average 2.1 primary symptoms per disease, the second 2.8, and the third 3.8. vocabulary compatibility: we map each disease and symptom of the collection to the unified medical language system (umls) vocabulary. the umls is a compendium of over 100 vocabularies (e.g. icd-10, mesh, snomed-ct) that are cross-linked with each other. this makes the collection compatible with the umls vocabulary and also with each of the over 100 cross-linked vocabularies. although the different vocabularies are compatible with the collection, a fair comparison of methods is only possible when the methods utilize the same vocabulary since the vocabulary impacts the evaluation outcome. for instance, the symptom loss of appetite is categorized as a symptom in mesh; whereas, in the cross-linked umls vocabulary, it is categorized as a disease. therefore, the symptom loss of appetite can be identified when using the mesh vocabulary, but it cannot be identified when using the umls vocabulary. evaluation: we consider following evaluation metrics for the collection: recall@k, precision@k, and ndcg@k at the cutoff k = 5 and k = 10. recall measures how many of the relevant symptoms are retrieved, precision measures how many of the retrieved symptoms are relevant, and finally, ndcg is a standard metric to evaluate graded relevance [5] . in this section, we discuss disease-symptom extraction methods used in previous studies. a commonly used resource for the extraction of disease-symptom relations are the articles of the pubmed database. pubmed contains more than 30 million biomedical articles, including the abstract, title, and various metadata. previous work [3, 7] uses the abstracts of the pubmed articles together with rule-based approaches. in particular, hassan et al. [3] derive patterns of disease-symptom relations from dependency graphs, followed by the automatic selection of the best patterns based on proposed selection criteria. martin et al. [7] generate extraction rules automatically, which are then inspected for their viability by medical experts. xia et al. [13] design special queries that include the name and synonyms of each disease and symptom. they use these queries to return the relevant articles, and use the number of retrieved results to perform a ranking via pointwise mutual information (pmi). the mentioned studies use resources that are not publicly available, i.e., rules in [3, 7] and special queries in [13] . to enable reproducibility in future studies, we define our baselines based on the methods that only utilize publicly available resources, described in the next section. here, we first describe two recently proposed methods [10, 14] for the extraction of disease-symptom relations as our baselines. afterwards, we describe limitations of the method described in [14] and propose an adapted version in which the limitations are addressed. we apply the methods on the open-access subset of the pubmed central (pmc) database, containing 1,542,847 medical articles. to have a common representation for diseases/symptoms across methods (including an unique name and identifier), we consider the 382 symptoms and 4,787 diseases from the medical subject headings (mesh) vocabulary [14] . given the set of diseases (x) and symptoms (s), each method aims to compute a relation scoring function λ(x, s) ∈ r between a disease x ∈ x and a symptom s ∈ s. in the following, we explain each method in detail. embedding: proposed by shah et al. [10] , the method is based on the cosine similarity of the vector representations of a disease and a symptom. we first apply metamap [1] , a tool for the identification of medical concepts within a given text, to the full text of all pmc articles to substitute the identified diseases/symptoms by their unique names. then, we train a word2vec model [8] with 300 dimensions and a window size of 15, following the parameter setting in [10] . using the word embedding, the disease-symptom relation is defined as λ(x, s) = cos(e x , e s ), where e refers to the vector representation of a word. cooccur: this method, proposed by zhou et al. [14] , calculates the relation of a disease and a symptom, by measuring the degree of their co-occurrences in the mesh-keywords of medical articles. the raw co-occurrence of the disease x and symptom s, is denoted by co(x, s). the raw co-occurrence does not consider the overall appearance of each symptom across diseases. for instance, symptoms like pain or obesity tend to co-occur with many diseases, and are therefore less informative. hence, the raw co-occurrence is normalized by an inverse symptom frequency (isf) measure, defined as isf(s) = |x| ns , where |x| is the total number of diseases and n s is the number of diseases that co-occur with s at least in one of the articles. finally, the disease-symptom relation is defined as λ(x, s) = co(x, s) × isf(s). we compute three variants of the cooccur method: -kwd: the disease-symptom relations are computed using the mesh-keywords of the ≈1.5 million pmc articles. -kwdlarge: while kwd uses the 1.5 million pmc articles, zhou et al. [14] apply the exact same method on the ≈30 million articles of the pubmed database. while they did not evaluate the effectiveness of their diseasesymptom relation extraction method, they published their relation scores which we will evaluate in this paper. -fulltext: applying the cooccur method only on mesh-keywords has two disadvantages: first, keywords are not available for all articles (e.g. only 30% of the ≈1.5 million pmc articles have keywords) and second, usually only the core topics of an article occur as keywords. we address these limitations by proposing an adaption of the cooccur method, in which we use the full text, the title, and the keywords of the ≈1.5 million pmc articles. specifically, we adapt the computation of the co-occurrence co(x, s), as follows: we first retrieve a set of relevant articles to a disease x, where an article is relevant if the disease exists in either the keyword, or the title section of the article. given these relevant articles and a symptom s, we compute the adapted co-occurrence co(x, s), which is the number of relevant articles in that the symptom occurs in the full text. the identification of the diseases in the title and symptoms in the full text is done using the metamap tool [1] . we now compare the disease-symptom extraction baselines on the proposed dsrcollection. the results for various evaluation metrics are shown in table 2 . the fulltext-variant of the cooccur method outperforms the other baselines on all evaluation metrics. this demonstrates the high effectiveness of our proposed adaption to the cooccur method. further, we see a clear advantage of the cooccur-method with meshkeywords from ≈30 million pubmed articles as the resource (kwdlarge) -in comparison to the same method with keywords from approximately 1.5 million pmc articles (kwd). this highlights the importance of the number of input samples to the method. error analysis: a common error source is a result of the fine granularity of the symptoms in the medical vocabularies. for example, the utilized mesh vocabulary contains the symptoms abdominal pain and abdomen, acute 4 . both symptoms can be found in the top ranks of the evaluated methods for the disease appendicitis (see table 3 ). however, since the corpus is not labeled on such a fine-grained level, the symptom abdomen, acute is counted as a false positive. another error source is a result of the bias in medical articles towards specific disease-symptom relationships. for instance, between the symptom obesity and periodontitis 5 a special relationship exists, which is the topic of various publications. despite obesity not being a characteristic symptom of a periodontitis, all methods return the symptom in the top-3 ranks. a promising research direction is the selective extraction of symptoms from biomedical literature by also considering the context (e.g. in a sentence) in that a disease/symptom appears. effective mapping of biomedical text to the umls metathesaurus: the metamap program measuring nominal scale agreement among many raters extracting disease-symptom relationships by learning syntactic patterns from dependency graphs cumulated gain-based evaluation of ir techniques binary and graded relevance in ir evaluations -comparison of the effects on ranking of ir systems the measurement of observer agreement for categorical data symptom extraction issue distributed representations of words and phrases and their compositionality automated medical diagnosis by ranking clusters across the symptom-disease network neural networks for mining the associations between diseases and symptoms in clinical notes enhancing ontology-driven diagnostic reasoning with a symptom-dependency-aware naïve bayes classifier evaluating wikipedia as a source of information for disease understanding mining disease-symptom relation from massive biomedical literature and its application in severe disease diagnosis human symptoms-disease network we introduced the disease-s ymptom relation collection (dsr-collection) for the evaluation of graded disease-symptom relations. we provided baseline results for two recent methods, one based on word embeddings and the second on the cooccurrence of mesh-keywords of medical articles. we proposed an adaption to the co-occurrence method to make it applicable to the full text of medical articles and showed significant improvement of effectiveness over the other methods. key: cord-020931-fymgnv1g authors: meng, changping; chen, muhao; mao, jie; neville, jennifer title: readnet: a hierarchical transformer framework for web article readability analysis date: 2020-03-17 journal: advances in information retrieval doi: 10.1007/978-3-030-45439-5_3 sha: doc_id: 20931 cord_uid: fymgnv1g analyzing the readability of articles has been an important sociolinguistic task. addressing this task is necessary to the automatic recommendation of appropriate articles to readers with different comprehension abilities, and it further benefits education systems, web information systems, and digital libraries. current methods for assessing readability employ empirical measures or statistical learning techniques that are limited by their ability to characterize complex patterns such as article structures and semantic meanings of sentences. in this paper, we propose a new and comprehensive framework which uses a hierarchical self-attention model to analyze document readability. in this model, measurements of sentence-level difficulty are captured along with the semantic meanings of each sentence. additionally, the sentence-level features are incorporated to characterize the overall readability of an article with consideration of article structures. we evaluate our proposed approach on three widely-used benchmark datasets against several strong baseline approaches. experimental results show that our proposed method achieves the state-of-the-art performance on estimating the readability for various web articles and literature. readability is an important linguistic measurement that indicates how easily readers can comprehend a particular document. due to the explosion of web and digital information, there are often hundreds of articles describing the same topic, but vary in levels of readability. this can make it challenging for users to find the articles online that better suit their comprehension abilities. therefore, an automated approach to assessing readability is a critical component for the development of recommendation strategies for web information systems, including digital libraries and web encyclopedias. text readability is defined as the overall effect of language usage and composition on readers' ability to easily and quickly comprehend the document [14] . in this work, we focus on evaluating document difficulty based on the composition of words and sentences. consider the following two descriptions of the concept rainbow as an example. 1 . a more rigid scientific definition from english wikipedia: a rainbow is a meteorological phenomenon that is caused by reflection, refraction and dispersion of light in water droplets resulting in a spectrum of light appearing in the sky. a rainbow is an arc of color in the sky that can be seen when the sun shines through falling rain. the pattern of colors starts with red on the outside and changes through orange, yellow, green, blue, to violet on the inside. clearly, the first description provides more rigidly expressed contents, but is more sophisticated due to complicated sentence structures and the use of professional words. in contrast, the second description is simpler, with respect to both grammatical and document structures. from the reader's perspective, the first definition is more appropriate for technically sophisticated audiences, while the second one is suitable for general audiences, such as parents who want to explain rainbows to their young children. the goal of readability analysis is to provide a rating regarding the difficulty of an article for average readers. as the above example illustrates that, many approaches for automatically judging the difficulty of the articles are rooted in two factors: the difficulty of the words or phrases, and the complexity of syntax [11] . to characterize these factors, existing works [3, 29] mainly rely on some explicit features such as average syllables per word, average words per sentence, etc. for example, the flesch-kincaid index is a representative empirical measure defined as a linear combination of these factors [4] . some later approaches mainly focus on proposing new features with the latest cohmetrix 3.0 [36] providing 108 features, and they combine and use the features using either linear functions or statistical models such as support vector machines or multilayer perceptron [12, 40, 41, 43, 51] . while these approaches have shown some merits, they also lead to several drawbacks. specifically (1) they do not consider sequential and structural information, and (2) they do not capture sentences-level or documentlevel semantics that are latent but essential to the task [11] . to address these issues, we propose readnet, a comprehensive readability classification framework that uses a hierarchical transformer network. the selfattention portion of the transformer encoder is better able to model long-range and global dependencies among words. the hierarchical structure can capture how words form sentences, and how sentences form documents, meanwhile reduce the model complexity exponentially. moreover, explicit features indicating the readability of different granularities of text can be leveraged and aggregated from multiple levels of the model. we compare our proposed model to a number of widely-adopted document encoding techniques, as well as traditional readability analysis approaches based on explicit features. experimental results on three benchmark datasets show that our work properly identifies the document representation techniques, and achieves the state-of-the-art performance by significantly outperform previous approaches. existing computational methods for readability analysis [3, 11, 29, 40, 53] mainly use empirical measures on the symbolic aspects of the text, while ignoring the sequence of words and the structure of the article. the flesch-kincaid index [28] and related variations use a linear combination of explicit features. although models based on these traditional features are helpful to the quantification of readability for small and domain-specific groups of articles, they are far from generally applicable for a larger body of web articles [10, 17, 45] . because those features or formulas generated from a small number of training text specifically selected by domain experts, they are far from generally representing the readability of large collections of corpora. recent machine learning methods on readability evaluation are generally in the primitive stage. [18] proposes to combine language models and logistic regression. the existing way to integrate features is through a statistical learning method such as svm [12, 20, 40, 41, 43, 51] . these approaches ignore the sequential or structural information on how sentences construct articles. efforts have also been made to select optimal features from current hundreds of features [15] . some computational linguistic methods have been developed to extract higher-level language features. the widelyadopted coh-metrix [22, 37] provides multiple features based on cohesion such as referential cohesion and deep cohesion. plenty of works have been conducted on utilizing neural models for sentimental or topical document classification or ranking, while few have paid attention to the readability analysis task. the convolutional neural network (cnn) [27] is often adopted in sentence-level classification which leverages local semantic features of sentence composition that are provided by word representation approaches. in another line of approaches, a recursive neural network [46] is adopted, which focuses on modeling the sequence of words or sentences. hierarchical structures of such encoding techniques are proposed to capture structural information of articles, and have been widely used in tasks of document classification [7, 32, 48] , and sequence generation [30] and sub-article matching [6] . hierarchical attention network [52] is the current state-of-the-art method for document classification, which employs attention mechanisms on both word and sentence levels to capture the uneven contribution of different words and sentences to the overall meaning of the document. the transformer model [50] uses multihead self-attention to perform sequence-to-sequence translation. self-attention is also adopted in text summarization, entailment and representation [31, 38] . unlike topic and sentiment-related document classification tasks that focus on leveraging portions of lexemes that are significant to the overall meanings and sentiment of the document, readability analysis requires the aggregation of difficulty through all sentence components. besides, precisely capturing the readability of documents requires the model to incorporate comprehensive readabilityaware features, including difficulty, sequence and structure information, to the corresponding learning framework. in this section, we present the problem definition, as well as some representative explicit features that are empirically adopted for the readability analysis task. the readability analysis problem is defined as an ordinal regression problem for articles. given an article with up to n sentences and each sentence with up to m words, an article can be represented as a matrix a whose i-th row a i,: corresponds to the i-th sentences, and a i,j denotes the j-th word of the i-th sentence. given an article a, a label will be provided to indicate the readability of this article. we consider the examples introduced in sect. 1, where two articles describe the same term "rainbow ". the first rigorous scientific article can be classified as "difficult", and the second general description article can be classified as "easy". instead of classifying articles into binary labels like "easy" or "difficult", more fine-grained labels can help people better understand the levels of readability. for instance, we can map the articles in standardization systems of english tests such as 5-level cambridge english exam (cee), where articles from professional level english exam (cpe) are regarded than those from introductory english exam (ket). previous works [11, 21, 22, 24, 25, 28, 34] have proposed empirical features to evaluate readability. correspondingly, we divide these features into sentence-level features and document-level features. sentence-level features seek to evaluate the difficulty of sentences. for instance, the sentence-level feature "number of words" for sentences can be averaged into "number of words per sentence" to evaluate the difficulty of documents. document-level features include the traditional readability indices and cohesion's proposed by coh-metrix [22] . these features are listed in table 1 . current approaches [12, 41, 43] average the sentence-level features of each sentence to construct document level features. furthermore, these features are concatenated with document-level features, and use an svm to learn on these features. the limitation lies in failing to capture the structure information of sentences and documents. for instance, in order to get the sentence level features for the document, it averages all these features of each sentence. it ignores . it measures the diversity of content. content words are adjectives, nouns, verbs and adverbs incidence of part-of-speech elements incidence of word categories (adjectives, nouns, verbs, adverbs, pronouns) per 1000 words in the text how these sentences construct an article and which parts of the document more significantly decides the readability of the document. while cohesion features provided by coh-metrix tries to captures relationships between sentences, these features mainly depend on the repeat of words across multiple sentences. they did not directly model how these sentences construct a document in perspectives of structure and sequence. briefly speaking, existing works are mainly contributing more features as shown in table 1 . but the current models used to aggregate these features are based on svm and linear models. in this work, we target to propose a more advanced model to better combine these features with document information. in order to address the limitations of traditional approaches, we propose read-net: the hierarchical transformer model for readability analysis as shown in the proposed model incorporates the explicit features with a hierarchical document encoder that encodes the sequence and structural information of an article. the first level of the hierarchical learning architecture models the formation of sentences from words. the second level models the formation of the article from sentences. the self-attention encoder (to be described in subsect. 4.1) is adapted from the vanilla transformer encoder [50] . the hierarchical structure, attention aggregation layer, combination with explicit features and transfer layer are specially designed for this readability analysis task. in this subsection, we introduce the encoding process of sentences in hierarchical mutli-head self-attention. the encoding process has three steps: (1) the selfattention encoder transforms the input sequence into a series of latent vectors; (2) the attention layer aggregates the encoded sequential information based on the induced significance of input units; (3) the encoded information is combined with the explicit features. transformer self-attention encoder. this encoder is adapted from the vanilla transformer encoder [50] . the input for this encoder is a i,: , which represents the i-th sentence. the embedding layer encodes each word a i,j into a d-dimensional vector based on word embedding. the output is a m × d-dimensional matrix b where d is the embedding dimension and m is the number of words. the position encoding layer indicates the relative position of each word a i,j . the elements of positional embedding matrix p where values in the i-th row j-th column is defined as follows. the embedded matrix b and positional embedding matrix p are added into the initial hidden state matrix h (0) = b + p . h (0) will go through a stack of p identical layers. each layer contains two parts: (i) the multi-head attention donated as function f mha defined in eq. 2, and (ii) the position-wise feed-forward f f f n defined in eq. 4. layer normalization is used to avoid gradient vanishing or explosion. multi-head self-attention function (f mha ) [50] encodes the relationship among query matrix q, key matrix k and value matrix v from different representation subspaces at different positions. position-wise feed-forward function f f f n [50] adopts two 1-dimensional convolution layers with kernel size 1 to encode input matrix x. for the l-th encoder layer, h (l) is encoded into h (l+1) according to eq. 5 attention aggregation layer. after p transformer encoder layers, each senwe first pass h (p) through a feed forward layer with d × d dimensional weights w 1 and bias term b 1 to obtain a hidden representation as u : then compute the similarity between u and the trainable d × 1 dimensional context matrix c via w = softmax(u c), which we use as importance weights to obtain the final embedding of the sentence a i,: : combination of explicit features. the sentence level features u i introduced in sect. 3.2 table 1 for i-th sentence are concatenated by h * i = h i ⊕ u i . the second level of the hierarchical learning architecture is on top of the first layer. n encoded vector h * i (1 ≤ i ≤ n) are concatenated as the input for this layer. the structure of second level is the same as the first level. the output of this level is a vector y as the overall embedding of this article. the goal of the transfer layer is to improve prediction quality on a target task where training data are scarce, while a large amount of other training data are available for a set of related tasks. the readability analysis problem suffers from the lack of labeled data. traditional benchmark datasets labeled by domain experts typically contain a small number of articles. for instance, cee contains 800 articles and weebit contains around 8 thousand articles. such quantities of articles are far smaller than those for sentiment or topic-related document classification tasks which typically involve over ten thousand articles even for binary classification [7, 27] . on the other hand, with the emerging of online encyclopedia applications such as wikipedia, it provides a huge amount of training dataset. for instance, english wikipedia and simple-english wikipedia contain more than 100 thousand articles which can be used to train a deep learning model. one fully connected layer combines the article embedding vector y and document-level features v from table 1 to output the readability label vector r after a softmax function. w t is the weight of the fully connected layer. for dataset with m categories of readability ratings, each document is embedded into r with m − 1 dimensions. if transfer learning is needed, instead of random initialization, this network is initialized with a pre-trained network based on a larger corpus. during the training process, update the transfer layer while keeping all other layers frozen. if transfer learning is not needed, all layers are updated during the training process. given dataset with m categories of readability ratings, the goal is to minimize ordinal regression loss [42] defined as eq. 7. r k represents the k-th dimension of the r vector. y is the true label. the threshold parameter θ 1 , θ 2 , . . . θ m−1 are also learned automatically from the data. here, the objective of learning the readability analysis model is essentially different from that of a regular document classification model, since the classes here do form a partial-order. however, the case of two classes degenerates the learning to the same as that of a binary classifier. for self-attention, the path length in the computation graph between long-range dependencies in the network is o(1) instead of o(n) for recurrent models such as lstm. shorter path length in the computation graph makes it easier to learn the interactions between any elements in the sequence. for readability analysis, modeling the overall interaction between words is more important than modeling the consequent words. for semantic understanding, the consequence of two words such as "very good" and "not good" make distinct semantic meanings. while for readability analysis, it does not make difference in difficulty to understand it. the overall evaluation of the words difficulties in the sentences matters. the hierarchical learning structure benefits in two ways. first, it mimics human reading behaviors, since the sentence is a reasonable unit for people to read, process and understand. people rarely check the interactions between arbitrary words across different sentences in order to understand the article. second, the hierarchical structure can reduce parameter complexity. for a document with n sentences, m words per sentence, d dimension per word, the parameter complexity of the model is o((nm) 2 d) for single level structure. while for the hierarchical structure, the parameter complexity is o(m 2 d + n 2 d). in this section, we present the experimental evaluation of the proposed approach. we first introduce the datasets used for the experiments, followed by the comparison of the proposed approach and baselines based on held-out evaluation, as well as detailed ablation analysis of different techniques enabled by our approach. we use the following three datasets in our experiment. table 2 reports the statistics of the three datasets including the average number of sentences per article n sent and the average number of words per sentence n word . wiki dataset [26] contains english wikipedia and simple english wikipedia. simple english wikipedia thereof is a simplified version of english wikipedia which only uses simple english words and grammars. this dataset contains 59,775 english wikipedia articles and 59,775 corresponding simple english wikipedia articles. cambridge english exam (cee) [51] categorizes articles based on the criteria of five cambridge english exam level (ket, pet, fce, cae, cpe). the five ratings are sequentially from the easiest ket to the hardest cpe. in total, it contains 110 ket articles, 107 pet articles, 153 fce articles, 263 cae articles and 155 cpe articles. even though this dataset designed for non-native speakers may differ from materials for native english speakers, the difficulty between five levels is still comparable. we test our model on this dataset in order to check whether our model can effectively evaluate the difficulty of english articles according to an existing standard. weebit [49] is one of the largest dataset for readability analysis. it contains 7,676 articles targeted at different age group readers from weekly reader magazine and bbc-bitesize website. weekly reader magazine categorizes articles according to the ages of targeted readers in 7-8, 8-9 and 9-10 years old. bbc-bitesize has two levels for age 11-14 and 15-16. the targeted age is used to evaluate readability levels. table 3 . cross-validation classification accuracy and standard deviation (in parentheses) on wikipedia (wiki), cambridge english exam (cee) and weebit dataset. we report accuracy on three groups of models: (1) statistical classification algorithms including multi-class logistic regression, linear svm and multilayer perceptron (mlp); (2) three types of document classifier cnn, hierarchical grnn using lstm cells (lstm), hierarchical attention network (hatt); (3) hierarchical attention network combined with explicit features (hatt+), and our proposed approach which combines explicit features and semantics with hierarchical self-attention (readnet). transfer learning is not used, and all parameters in the model are initialized randomly (transfer learning is evaluated separately in table 5 ). in this subsection, we provide a detailed evaluation of the proposed approach. baseline approaches. we compare our proposed approach (denoted readnet) against the following baseline methods. -statistical classification algorithms based on explicit features: this category of baselines including the statistical classification algorithms that are widely adopted in a line of previous works [12, 20, 40, 41, 43, 51] , such as multiclass logistic regression, the linear svm, and the multilayer perceptron (mlp) [49] . explicit features on which these models are trained have been introduced in sect. 3 [27] , the hierarchical gated neural network with long short-term memory (lstm) [48] , and the hierarchical attention network (hatt) [52] . -the hierarchical attention network combined with explicit features (hatt+), for which we use the same mechanism as our proposed approach to incorporate the explicit features into the representation of each sentence by the attentive rnn. model configurations. for article encoding, we limit the number of sentences of each article to up to 50, zero-pad short ones and truncate over-length ones. according to the data statistics in table 2 , 50 sentences are enough to capture the majority of information of articles in the datasets. for each sentence, we also normalize the number of words to be fed into the model as 50, also via zero-padding and truncating. we fix the batch size to 32, and use adam [16] as the optimizer with a learning rate 0.001. the epochs of training for the neural models are limited to 300. we set the number of encoder layers p and q to 6. the embedding dimension d = 100. number of heads h in f mha is 3. cnn adopts the same configuration as [27] . other statistical classification algorithms are trained until converge. source code will be available in the final version. evaluation protocol. we formalize the task as a classification task following previous works on the three benchmark datasets. in order to provide a valid quantitative evaluation, we have to follow the existing evaluation method to show the advantage of our proposed model compared with the baselines. we adopt 5-fold cross-validation to evaluate the proposed model and baselines. we report the classification accuracy that is aggregated on all folds of validation. the results are reported in table 3 . traditional explicit features can provide satisfying results. since the multi-class logistic regression, svm and mlp models can combine the features number of words per sentence and number of syllabi per word which are included in flesch-kincaid score, they provide the reasonable result. cnn is only slightly better than random guess. we assume that this is because cnn does not capture the sequential and structural information of documents. the hatt approach provides the best among models without explicit features. the reasons root in the structure of the model which is able to capture length and structural information of the article. since it also adopted a hierarchical structure, the conciseness of each sentence and that of the overall article structure is captured, which appears to be significant to the task. the explicit features further improve the results of hatt as shown by hatt+. even without explicit features, our proposed approach is better than hatt+. hatt has appeared to be successful at highlighting some lexemes and sentence components that are significant to the overall meanings or sentiment of a document. however, unlike topic and sentiment-related document classification tasks, readability does not rely on several consecutive lexemes, but the aggregation of all sentence components. the path length in the computation graph between arbitrary components dependencies in readnet is o(1) instead of o(n) for hatt. shorter path length in the computation graph makes it easier to learn the interactions between any arbitrary words in sentence level, or sentences in document-level. compared with traditional approaches, the main advantage of the proposed approach is that it uses the document encoder to learn how words are connected into sentences and how sentences are connected into documents. baseline approaches only use the averaged explicit features of all the sentences. for these datasets, several extremely difficult and complicated sentences usually determine the readability of a document. this useful information is averaged and weakened by the total number of sentences in baselines. as shown in table 3 , the standard deviation of the cee task is large compared with those in wiki and weebit tasks since the quantity of cee articles is not enough to train a complex deep learning model. transfer layer in readnet is utilized in three steps. first is to train and save the model from larger datasets such as wiki or weebit. then, we initialize the model for cee task and load the parameter weights from the saved model except for the transfer layer. eventually on the target task, the transfer layer is trained while keeping all other layers fixed. as shown in table 5 , loading a pre-trained model based on weebit or wiki can increase the accuracy and decrease standard deviation on the cee task. it is shown that a more accurate and stable model can be achieved by utilizing the transfer layer and well-trained models from related tasks. table 5 . accuracy for cee classification using the transfer layer. original is the model not using transfer learning, and without loading trained weights from other dataset. load weebit is to load the parameters weights trained in weebit except the transfer layer. load wiki is to load the parameters weights trained in wiki except the transfer layer. load weebit load wiki besides directly training and evaluating the same dataset, we also tried the model trained using wikipedia dataset and evaluate on cambridge english dataset. 10 articles are randomly selected from each level of cambridge english test. the probability of being classified as regular english wikipedia instead of simple english wikipedia is treated as the difficulty score. the average difficulty scores predicted by the model are shown in table 4 , which shows that our produced readability score implies correctly the difficulty of english documents for different levels of exams. a larger score indicates higher difficulty. these scores correctly indicate the difficulty levels of these exams. we have proposed a model to evaluate the readability of articles which can make great contributions to a variety of applications. our proposed hierarchical self-attention framework outperforms existing approaches by combining hierarchical document encoders with the explicit features proposed by linguistics. for future works, we are interested in providing the personalized recommendation of articles based on the combination of article readability and the understanding ability of the user. currently, readability of articles only evaluate the texts of articles, other modalities such as images [39] and taxonomies [8] considered to improve readers' understanding. more comprehensive document encoders such as rcnn [5] and tree lstm [47] may also be considered. lix and rix: variations on a little-known readability index student, text and curriculum modeling for reader-specific document retrieval readability: an appraisal of research and application readability revisited: the new dale-chall readability formula multifaceted protein-protein interaction prediction based on siamese residual rcnn neural article pair modeling for wikipedia sub-article matching learning to differentiate between main-articles and sub-articles in wikipedia on2vec: embedding-based relation prediction for ontology population a computer readability formula designed for machine scoring a language modeling approach to predicting reading difficulty computational assessment of text readability: a survey of current and future research predicting reading difficulty with statistical language models a new academic word list the concept of readability all mixed up? finding the optimal feature set for general readability prediction and its application to english and dutch adaptive subgradient methods for online learning and stochastic optimization cognitively motivated features for readability assessment combining a statistical language model with logistic regression to predict the lexical and syntactic difficulty of texts for ffl a readability formula that saves time the varied uses of readability measurement today linguistic complexity: locality of syntactic dependencies coh-metrix: analysis of text on cohesion and language the fog index after twenty years an analysis of statistical models and features for reading difficulty prediction combining lexical and grammatical features to improve readability measures for first and second language texts improving text simplification language modeling using un simplified text data convolutional neural networks for sentence classification derivation of new readability formulas for navy enlisted personnel the measurement of readability: useful information for communicators a hierarchical neural autoencoder for paragraphs and documents hierarchical attention transfer network for cross-domain sentiment classification hierarchical recurrent neural network for document modeling an analytic and cognitive parametrization of coherence relations measures of lexical richness smog grading-a new readability formula automated evaluation of text and discourse with coh-metrix coh-metrix: capturing linguistic features of cohesion a decomposable attention model for natural language inference embedding multimodal relational data for knowledge base completion predicting proficiency levels in learner writings by transferring a linguistic complexity model from expert-written coursebooks revisiting readability: a unified framework for predicting text quality loss functions for preference levels: regression with discrete ordered labels reading level assessment using support vector machines and statistical language models automated readability index a statistical model for scientific readability recursive deep models for semantic compositionality over a sentiment treebank improved semantic representations from treestructured long short-term memory networks document modeling with gated recurrent neural network for sentiment classification on improving the accuracy of readability classification using insights from second language acquisition attention is all you need text readability assessment for second language learners hierarchical attention networks for document classification readability: its past, present, and future key: cord-020835-n9v5ln2i authors: jangra, anubhav; jatowt, adam; hasanuzzaman, mohammad; saha, sriparna title: text-image-video summary generation using joint integer linear programming date: 2020-03-24 journal: advances in information retrieval doi: 10.1007/978-3-030-45442-5_24 sha: doc_id: 20835 cord_uid: n9v5ln2i automatically generating a summary for asynchronous data can help users to keep up with the rapid growth of multi-modal information on the internet. however, the current multi-modal systems usually generate summaries composed of text and images. in this paper, we propose a novel research problem of text-image-video summary generation (tivs). we first develop a multi-modal dataset containing text documents, images and videos. we then propose a novel joint integer linear programming multi-modal summarization (jilp-mms) framework. we report the performance of our model on the developed dataset. advancement in technology has led to rapid growth of multimedia data on the internet, which prevent users from obtaining important information efficiently. summarization can help tackle this problem by distilling the most significant information from the plethora of available content. recent research in summarization [2, 11, 31] has proven that having multi-modal data can improve the quality of summary in comparison to uni-modal summaries. multi-modal information can help users gain deeper insights. including supportive representation of text can reach out to a larger set of people including those who have reading disabilities, users who have less proficiency in the language of text and skilled readers who are looking to skim the information quickly [26] . although visual representation of information is more expressive and comprehensive in comparison to textual description of the same information, it is still not a thorough model of representation. encoding abstract concepts like guilt or freedom [11] , geographical locations or environmental features like temperature, humidity etc. via images is impractical. also images are a static medium and cannot represent dynamic and sequential information efficiently. including videos could then help overcome these barriers since video contains both visual and verbal information. to the best of our knowledge, all the previous works have focused on creating text or text-image summaries, and the task of generating an extractive multimodal output containing text, images and videos from a multi-modal input has not been done before. we thus focus on a novel research problem of text-imagevideo summary generation (tivs). to tackle the tivs task, we design a novel integer linear programming (ilp) framework that extracts the most relevant information from the multimodal input. we set up three objectives for this task, (1) salience within modality, (2) diversity within modality and (3) correspondence across modalities. for preprocessing the input, we convert the audio into text using an automatic speech recognition (asr) system, and we extract the key-frames from video. the most relevant images and videos are then selected in accordance with the output generated by our ilp model. to sum up, we make the following contributions: (1) we present a novel multimodal summarization task which takes news with images and videos as input, and outputs text, images and video as summary. (2) we create an extension of the multi-modal summarization dataset [12] by constructing multi-modal references containing text, images and video for each topic. (3) we design a joint ilp framework to address the proposed multi-modal summarization task. text summarization techniques are used to extract important information from textual data. a lot of research has been done in the area of extractive [10, 21] and abstractive [3, 4, 19, 23] summarization. various techniques like graph-based methods [6, 15, 16] , artificial neural networks [22] and deep learning based approaches [18, 20, 29] have been developed for text summarization. integer linear programming (ilp) has also shown promising results in extractive document summarization [1, 9] . duan et al. [5] proposed a joint-ilp framework that produces summaries from temporally separate text documents. recent years have shown great promise in the emerging field of multi-modal summarization. multi-modal summarization has various applications ranging from meeting recordings summarization [7] , sports video summarization [25] , movie summarization [8] to tutorial summarization [13] . video summarization [17, 28, 30] is also a major sub-domain of multi-modal summarization. a few deep learning frameworks [2, 11, 31] show promising results, too. li et al. [12] uses an asynchronous dataset containing text, images and videos to generate a textual summary. although some work on document summarization has been done using ilp, to the best of our knowledge no one has ever used an ilp framework in the area of multi-modal summarization. our objective is to generate a multimodal summary s = x sum i sum v sum such that the final summary s covers up all the important information in the original data while minimizing the length of summary, where each topic in our dataset comprises of text documents, images, audio and videos. as shown in fig. 1 , we firstlextract key-frames from the videos [32] . these keyframes together with images from the original data form the image-set. the audio is transcribed into text (ibm watson speech-to-text service: www.ibm.com/ watson/developercloud/speech-to-text.html), which contributes to the text-set together with the sentences from text-documents. the images from then imageset are encoded by the vgg model [24] and the 4,096-dimensional vector from the pre-softmax layer is used as the image representation. every sentence from the text-set is encoded using the hybrid gaussian-laplacian mixture model (hglmm) into a 6,000-dimensional vector. for text-image matching, these image and sentence vectors are fed into a two-branch neural network [27] to have a 512-dimensional vector for images and sentences in a shared space. ilp is a global optimization technique, used to maximize or minimize an objective function subject to some constraints. in this paper, we propose a joint-ilp technique to optimize the output to have high salience, diversity and crossmodal correlation. the idea of joint-ilp is similar to the one applied in the field of across-time comparative summarization [5] . however, to the best of our knowledge, an ilp framework was not used to solve multi-modal summarization (gurobi optimizer is used for ilp optimization: https://www.gurobi.com/). decision variables. m txt is a n × n binary matrix such that m txt i,i indicates whether sentence s i is selected as an exemplar or not and m txt i,j =i indicates whether sentence s i votes for s j as its representative. similarly, m img is a p × p binary matrix that indicates the exemplars chosen in the image set. m c is n × p binary matrix that indicates the cross-modal correlation. m c i,j is true when there is some correlation between sentence s i and image i j . where mod, t, item ∈ { text, n, s , img, p, i } is used to represent multiple modalities together in a simple way. we need to maximize the objective function in eq. 1, containing salience of text, images and cross-modal correlation. similar to the joint-ilp formulation in [5] the diversity objective is implicit in this model. equation 4 generates the set of entities that are a part of the cluster whose exemplar is item i . the salience is calculated by eqs. 2 and 3 by taking cosine similarity over all the exemplars with the items belonging to their representative clusters separately for each modality. the cross-modal correlation score is calculated in eq. 5. equation 7 ensures that exactly k txt and k img clusters are formed in their respective uni-modal vector space. equation 8 guarantees that an entity can either be an exemplar or be part of a single cluster. according to eq. 9, a sentence or image must be exemplar in their respective vector space to be included in the sentence-image summary pairs. values of m, k txt and k img are set to be 10, same as in [5] . the joint-ilp framework outputs the text summary (x sum ) and top-m images from the image-set. this output is used to prepare the image and video summary. equation 11 selects all those images from top10 images that are not keyframes. assuming that images which look similar would have similar annotation scores and would help users gain more insight, the images relevant to the images in i sum1 (at least with α cosine similarity) but not too similar (at max with β cosine similarity) to avoid redundancy are also selected to be a part of the final image summary i sum (eq. 12). α is set to 0.4 and β is 0.8 in our experiments. extracting video. for each video, weighted sum of visual (eq. 13) and verbal (eq. 14) scores is computed. the video with the highest score is selected as our video summary. where kf is the set of all key-frames and st is the set of speech transcriptions. there is no benchmark dataset for the tivs task. therefore, we created our own text-image-video dataset by extending and manually annotating the multi-modal summarization dataset introduced by li et al. [12] . their dataset comprised of 25 new topics. each topic was composed of 20 text documents, 3 to 9 images, and 3 to 8 videos. the final summary however was unimodal, that is, in the form of only a textual summary containing around 300 words. we then extended it by selecting some images and a video for each topic that summarize the topic well. three undergraduate students were employed to score the images and videos with respect to the benchmark text references. all annotators scored each image and video on a scale of 1 to 5, on the basis of similarity between the image/video and the text references (1 indicating no similarity and 5 denoting the highest level of similarity). average annotation scores (aas) were calculated for each image and video. the value of the minimum average annotation score for images was kept as a hyper-parameter to evaluate the performance of our model in various settings 2 . the video with the highest score is chosen to be the video component of the multi-modal summary 3 . we evaluate the performance of our model using the dataset as described above. we use the rouge scores [14] to evaluate the textual summary, and based on them we compare our results with the ones of three baselines. we use the multi-document summarization model proposed in [1] . for baseline-1 we feed the model with embedded sentences from all the original documents together. the central vector is calculated as the average of all the sentence vectors. the model is given vectors for sentences from the text-set and images from the image-set in the joint space for other baselines. for baseline-2, the average of all the vectors is taken as the central vector. for baseline-3, the central vector is calculated as the weighted average of all the sentence and image vectors. we give equal weights to text, speech and images for simplicity. as shown in table 1 , our model produces better results than the prepared baselines in terms of rouge-2 and rouge-l scores. table 2 shows the average precision and recall scores as well as the variance. we set various threshold values for the annotation scores to generate multiple image test sets in order to evaluate the performance of our model. we get a higher precision score for low aas value, because the number of images in the final solution increases on decreasing the threshold values. the proposed model gave 44% accuracy in extracting the most appropriate video (whereas random selection of images for 10 different iterations gives an average 16% accuracy). unlike other problems that focus on text-image summarization, we propose to generate a truly multi-modal summary comprising of text, images and video. we also develop a dataset for this task, and propose a novel joint ilp framework to tackle this problem. multi-document summarization model based on integer linear programming abstractive text-image summarization using multi-modal attentional hierarchical rnn fast abstractive summarization with reinforce-selected sentence rewriting abstractive sentence summarization with attentive recurrent neural networks across-time comparative summarization of news articles lexrank: graph-based lexical centrality as salience in text summarization multimodal summarization of meeting recordings multimodal saliency and fusion for movie summarization based on aural, visual, and textual attention extractive multi-document summarization with integer linear programming and support vector regression a trainable document summarizer multi-modal sentence summarization with modality attention and image filtering multi-modal summarization for asynchronous collection of text, image multimodal abstractive summarization for open-domain videos rouge: a package for automatic evaluation of summaries graph-based ranking algorithms for sentence extraction, applied to text summarization textrank: bringing order into text streaming non-monotone submodular maximization: personalized video summarization on the fly summarunner: a recurrent neural network based sequence model for extractive summarization of documents abstractive text summarization using sequence-to-sequence rnns and beyond classify or select: neural architectures for extractive document summarization constructing literature abstracts by computer: techniques and prospects extractive single document summarization using multi-objective optimization: exploring self-organized differential evolution, grey wolf optimizer and water cycle algorithm. knowl.-based syst get to the point: summarization with pointergenerator networks very deep convolutional networks for large-scale image recognition multi-modal summarization of key events and top players in sports tournament videos multimodal summarization of complex sentences learning deep structure-preserving image-text embeddings video summarization via semantic attended networks multiview convolutional neural networks for multidocument extractive summarization deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward msmo: multimodal summarization with multimodal output adaptive key frame extraction using unsupervised clustering key: cord-020794-d3oru1w5 authors: leekha, maitree; goswami, mononito; jain, minni title: a multi-task approach to open domain suggestion mining using language model for text over-sampling date: 2020-03-24 journal: advances in information retrieval doi: 10.1007/978-3-030-45442-5_28 sha: doc_id: 20794 cord_uid: d3oru1w5 consumer reviews online may contain suggestions useful for improving commercial products and services. mining suggestions is challenging due to the absence of large labeled and balanced datasets. furthermore, most prior studies attempting to mine suggestions, have focused on a single domain such as hotel or travel only. in this work, we introduce a novel over-sampling technique to address the problem of class imbalance, and propose a multi-task deep learning approach for mining suggestions from multiple domains. experimental results on a publicly available dataset show that our over-sampling technique, coupled with the multi-task framework outperforms state-of-the-art open domain suggestion mining models in terms of the f-1 measure and auc. consumers often express their opinions towards products and services through online reviews and discussion forums. these reviews may include useful suggestions that can help companies better understand consumer needs and improve their products and services. however, manually mining suggestions amid vast numbers of non-suggestions can be cumbersome, and equated to finding needles in a haystack. therefore, designing systems that can automatically mine suggestions is essential. the recent semeval [6] challenge on suggestion mining saw many researchers using different techniques to tackle the domain-specific task (in-domain suggestion mining). however, open-domain suggestion mining, which obviates the need for developing separate suggestion mining systems for different domains, is still an emerging research problem. we formally define the problem of open-domain suggestion mining as follows: building on the work of [5] , we design a framework to detect suggestions from multiple domains. we formulate a multitask classification problem to identify both the domain and nature (suggestion or non-suggestion) of reviews. furthermore, we also propose a novel language model-based text over-sampling approach to address the class imbalance problem. we use the first publicly available and annotated dataset for suggestion mining from multiple domains created by [5] . it comprises of reviews from four domains namely, hotel, electronics, travel and software. during pre-processing, we remove all urls (eg. https:// ...) and punctuation marks, convert the reviews to lower case and lemmatize them. we also pad the text with start s and end e symbols for over-sampling. one of the major challenges in mining suggestions is the imbalanced distribution of classes, i.e. the number of non-suggestions greatly outweigh the number of suggestions (refer table 1 ). to this end, studies frequently utilize synthetic minority over-sampling technique (smote) [1] to over-sample the minority class samples using the text embeddings as features. however, smote works in table 1 . datasets and their sources used in our study [5] . the class ratio column highlights the extent of class imbalance in the datasets. the travel datasets have lower inter-annotator agreement than the rest, indicating that they may contain confusing reviews which are hard to confidently classify as suggestions or non-suggestions. this also reflects in our classification results. the euclidean space and therefore does not allow an intuitive understanding and representation of the over-sampled data, which is essential for qualitative and error analysis of the classification models. we introduce a novel over-sampling technique, language model-based over-sampling technique (lmote), exclusively for text data and note comparable (and even slightly better sometimes) performance to smote. we use lmote to over-sample the number of suggestions before training our classification model. for each domain, lmote uses the following procedure to over-sample suggestions: find top η n-grams: from all reviews labelled as suggestions (positive samples), sample the top η = 100 most frequently occurring n-grams (n = 5). for example, the phrase "nice to be able to" occurred frequently in many domains. train a bilstm language model on the positive samples (suggestions). the bilstm model predicts the probability distribution of the next word (w t ) over the whole vocabulary (v ∪ e) based on the last n = 5 words (w t−5 , . . . , w t−1 ), i.e., the model learns to predict the probability distribution n-grams: using the language model and a randomly chosen frequent 5-gram as the seed, we generate text by repeatedly predicting the most probable next word (w t ), until the end symbol e is predicted. table 2 comprises of the most frequent 5-grams and their corresponding suggestions 'sampled' using lmote. in our study, we generate synthetic positive reviews till the number of suggestion and non-suggestion class samples becomes equal in the training set. seed ← random(n grams) 6: sample ← lmotegenerate(language model, seed) 7: s ← s ∪ sample 8: end while 9: return s algorithm 1 summarizes the lmote over-sampling methodology. following is a brief description of the sub-procedures used in the algorithm: • lmotegenerate(language model, seed): the procedure takes as input the trained language model and a randomly chosen n-gram from the set of top η n-grams as seed, and starts generating a review till the end tag, e is produced. the procedure is repeated until we have a total of n suggestion reviews. multi-task learning (mtl) has been successful in many applications of machine learning since sharing representations between auxiliary tasks allows models to generalize better on the primary task. figure 1b illustrates 3-dimensional umap [4] visualization of text embeddings of suggestions, coloured by their domain. these embeddings are outputs of the penultimate layer (dense layer before the final softmax layer) of the single task (stl) ensemble baseline. it can be clearly seen that suggestions from different domains may have varying feature representations. therefore, we hypothesize that we can identify suggestions better by leveraging domain-specific information using mtl. therefore, in the mtl setting, given a review r i in the dataset, d, we aim to identify both the domain of the review, as well as its nature. we use an ensemble of three architectures namely, cnn [2] to mirror the spatial perspective and preserve the n-gram representations; attention network to learn the most important features automatically; and a bilstm-based text rcnn [3] model to capture the context of a text sequence (fig. 2) . in the mtl setting, the ensemble has two output softmax layers, to predict the domain and nature of a review. the stl baselines on the contrary, only have a singe softmax layer to predict the nature of the review. we use elmo [7] word embeddings trained on the dataset, as input to the models. we conducted experiments to assess the impact of over-sampling, the performance of lmote and the multi-task model. we used the same train-test split as provided in the dataset for our experiments. all comparisons have been made in terms of the f-1 score of the suggestion class for a fair comparison with prior work on representational learning for open domain suggestion mining [5] (refer baseline in table 3 ). for a more insightful evaluation, we also compute the area under receiver operating characteristic (roc) curves for all models used in this work. tables 3, 4 over-sampling improves performance. to examine the impact of oversampling, we compared the performance of our ensemble classifier with and without over-sampling i.e. we compared results under the stl, stl + smote and stl + lmote columns. our results confirm that in general, over-sampling suggestions to obtain a balanced dataset improves the performance (f-1 score & auc) of our classifiers. we compared the performance of smote and lmote in the single task settings (stl + smote and stl + lmote ) and found that lmote performs comparably to smote (and even outperforms it in the electronics and software domains). lmote also has the added advantage of resulting in intelligible samples which can be used to qualitatively analyze and troubleshoot deep learning based systems. for instance, consider suggestions created by lmote in table 2 . while the suggestions may not be grammatically correct, their constituent phrases are nevertheless semantically sensible. multi-task learning outperforms single-task learning. we compared the performance of our classifier in single and multi-task settings (stl + lmote and mtl + lmote ) and found that by multi-task learning improves the performance of our classifier. we qualitatively analysed the single and multi task models, and found many instances where by leveraging domain-specific information the multi task model was able to accurately identify suggestions. for instance, consider the following review: "bring a lan cable and charger for your laptop because house-keeping doesn't provide it." while the review appears to be an assertion (non-suggestion), by predicting its domain (hotel), the multitask model was able to accurately classify it as a suggestion. in this work, we proposed a multi-task learning framework for open domain suggestion mining along with a novel language model based over-sampling technique for text-lmote. our experiments revealed that multi-task learning combined with lmote over-sampling outperformed considered alternatives in terms of both the f1-score of the suggestion class and auc. smote: synthetic minority over-sampling technique convolutional neural networks for sentence classification recurrent convolutional neural networks for text classification umap: uniform manifold approximation and projection for dimension reduction suggestion mining from text semeval-2019 task 9: suggestion mining from online reviews and forums. in: semeval@naacl-hlt deep contextualized word representations key: cord-020832-iavwkdpr authors: nguyen, dat quoc; zhai, zenan; yoshikawa, hiyori; fang, biaoyan; druckenbrodt, christian; thorne, camilo; hoessel, ralph; akhondi, saber a.; cohn, trevor; baldwin, timothy; verspoor, karin title: chemu: named entity recognition and event extraction of chemical reactions from patents date: 2020-03-24 journal: advances in information retrieval doi: 10.1007/978-3-030-45442-5_74 sha: doc_id: 20832 cord_uid: iavwkdpr we introduce a new evaluation lab named chemu (cheminformatics elsevier melbourne university), part of the 11th conference and labs of the evaluation forum (clef-2020). chemu involves two key information extraction tasks over chemical reactions from patents. task 1—named entity recognition—involves identifying chemical compounds as well as their types in context, i.e., to assign the label of a chemical compound according to the role which the compound plays within a chemical reaction. task 2—event extraction over chemical reactions—involves event trigger detection and argument recognition. we briefly present the motivations and goals of the chemu tasks, as well as resources and evaluation methodology. the chemical industry undoubtedly depends on the discovery of new chemical compounds. however, new chemical compounds are often initially disclosed in patent documents, and only a small fraction of these compounds are published in journals, usually taking an additional 1-3 years after the patent [13] . therefore, most chemical compounds are only available through patent documents [3] . in addition, chemical patent documents contain unique information, such as reactions, experimental conditions, mode of action, which is essential for the understanding of compound prior art, providing a means for novelty checking and validation as well as pointers for chemical research in both academia and industry [1, 2] . as the number of new chemical patent applications has been drastically increasing [11] , it is becoming crucial to develop natural language processing (nlp) approaches that enable automatic extraction of key information from the chemical patents [2] . in this paper, we propose a new evaluation lab (called chemu) focusing on information extraction over chemical reactions from patents. in particular, we will focus on two key information extraction tasks of chemical named entity recognition (ner) and chemical reaction event extraction. while previous related shared tasks focusing on chemicals or drugs such as chemdner [7] have also included chemical named entity recognition as a task, those have primarily focused on pubmed abstracts. the chemdner patents task [8] was limited to entity mentions and chemical entity passage detection, and only considered titles and abstracts of patents. for our chemu lab, we extend the existing corpora in several directions: first, we go beyond chemical ner to require labeling of the role of a chemical with respect to a reaction, and to consider complete chemical reactions in addition to entities. the chemu website is available at: http:// chemu.eng.unimelb.edu.au. what are the goals of this evaluation lab? our goals are: (1) to develop tasks that impact chemical research in both academia and industry, (2) to provide the community with a new dataset of chemical entities, enriched with relational links between chemical event triggers and arguments, and (3) to advance the state-of-the-art in information extraction over chemical patents. why is this lab needed? for evaluating information extraction developments in the scientific literature domain, there have been a large number of labs/shared tasks offered within previous i2b2/n2c2, semeval, bionlp, biocreative, trec and clef workshops. however, less attention has been paid to the chemical patent domain. in particular, there has previously been only one shared task on this domain, which is the chemdner patents task at the biocreative v workshop, involving detection of mentions of chemical compounds and genes/proteins in patent text [8] . information extraction approaches developed for the scientific literature domain may not apply directly to the chemical patent domain. this is because as legal documents, patents are written very differently as compared to scientific literature. when writing scientific papers, authors strive to make their words as clear and straightforward as possible, whereas patent authors often seek to protect their knowledge from being fully disclosed [15] . in tension with this is the need to claim broad scope for intellectual property reasons, and hence patents typically contain more details and are more exhaustive than scientific papers [9] . there are also a number of characteristics of patent texts that create challenges for nlp in this context. long sentences listing names of compounds are frequently used in chemical patents. the structure of sentences in patent claims table 1 . brief definitions of chemu chemical entity types, organised into chemical entity types, a reaction label introduced in the text, and reaction properties. reaction product a product is a substance that is formed during a chemical reaction starting material a substance that is consumed in the course of a chemical reaction providing atoms to products is considered as starting material reagent catalyst a reagent is a compound added to a system to cause or help with a chemical reaction. compounds like catalysts, bases to remove protons or acids to add protons must be also annotated with this tag solvent a solvent is a chemical entity that dissolves a solute resulting in a solution other compound other chemical compounds that are not the products, starting materials, reagents, catalysts and solvents example label a label associated with a reaction specification temperature the temperature at which the reaction was carried out must be annotated with this tag time the reaction time of the reaction yield percent yield given in percent values yield other yields provided in other units than % is usually complex, and syntactic parsing in patents can be difficult [4] . a quantitative analysis from [16] showed that the average sentence length in a patent corpus is much longer than in general language use. that work also showed that the lexicon used in patents usually includes domain-specific and novel terms that are difficult to understand. the chemu lab will provide a new challenging set of tasks, in an area of significant pharmacological importance. the lab will focus attention on more complex analysis of chemical patents, provide strong baselines as well as providing a useful resource for future research. what are usage scenarios? automatically identifying compounds which serve as the starting material or are a product of a chemical reaction would allow more targeted extraction of chemical information from patents and can improve the usefulness of patent resources. automatic extraction of chemical reaction events supports the construction of cheminformatics databases, capturing key information about chemicals and how they are produced, from the patent resources. the chemu lab at clef-2020 1 offers the two information extraction tasks of named entity recognition (task 1) and event extraction (task 2) over chemical reactions from patent documents. teams may participate in one or both tasks. in general, a chemical reaction is a process leading to the transformation of one set of chemical substances to another [10] . task 1 involves identifying chemical compounds and their specific types, i.e. to assign the label of a chemical compound according to the role which it plays within a chemical reaction. in addition to chemical compounds, this task also requires identification of the temperatures and reaction times at which the chemical reaction is carried out, as well as yields obtained for the final chemical product and the label of the reaction. this task involves both entity boundary prediction and entity label classification. we define 10 different entity type labels as shown in table 1 . see examples of those entity types in table 2 . as illustrated in figs. 1 and 2 , a chemical reaction leading to an end product often consists of a sequence of individual event steps. task 2 is to identify those steps which involve chemical entities recognized from task 1. unlike a conventional event extraction problem [6] which involves event trigger word detection, event typing and argument prediction, our task 2 requires identification of event trigger words (e.g. "added" and "stirred") which all have the same type of fig. 2 . brat visualization of a more complex event with the trigger word "added" involving five arguments. "event trigger", and then determination of the chemical entity arguments of these events. 2 when predicting event arguments, we adapt semantic argument role labels arg1 and argm from the proposition bank [12] to label the relations between the trigger words and the chemical entities: arg1 is used to label the relation between an event trigger word and a chemical compound. here, arg1 represents argument roles of being causally affected by another participant in the event [5] . argm represents adjunct roles with respect to an event, used to label the relation between a trigger word and a temperature, time or yield entity. an end-to-end process incorporating both task 1 and task 2 can be equivalently viewed as a relation extraction task which identifies 11 entity types including 10 types defined in table 1 plus "event trigger", and extracts relations between the "event trigger" entities and the remaining entities. data: for system development and evaluation, a new corpus of 1500 chemical reaction snippets will be provided for both tasks (an example of a chemical reaction snippet is shown in table 2 ). these snippets are sampled from 170 english document patents from the european patent office and the united states patent and trademark office. we will mark up every chemical compound or event trigger with both text spans and ids, and highlight relations and event arguments, as illustrated in figs. 1 and 2. we have begun preparing the corpus and will make available strong baselines for the tasks. initial publications related to the data and task 1 appear at the 2019 alta and bionlp workshops, respectively [18, 19] . the corpus will be split into 70%/10%/20% training/development/test. gold annotations for the training and development sets will be provided to task participants in the brat standoff format [14] during the development phase. the raw test set will be provided for final test phase. to support teams who are interested in task 2 only, a pre-trained chemical ner tagger is provided as a resource [19] . evaluation: for evaluation, precision, recall and f1 scores will be used, under both strict and relaxed span matching conditions. f1 will be the main metric for ranking the participating teams [17] . 3 in this paper, we have presented a brief description of the upcoming chemu lab at clef-2020. chemu will focus on two new tasks of named entity recognition and event extraction over chemical reactions from patents. we expect participants from both academia and industry. we will advertise our chemu lab via social media as well as nlp-related mailing lists. annotated chemical patent corpus: a gold standard for text mining automatic identification of relevant chemical compounds from patents patents: a unique source for scientific technical information in chemistry related industry? world pat improving automated patent claim parsing: dataset, system, and experiments semantic role labeling and argument structure. in: speech and language processing overview of bionlp'09 shared task on event extraction chemdner: the drugs and chemical names extraction challenge overview of the chemdner patents task current challenges in patent information retrieval, 1st edn glossary of terms used in physical organic chemistry (iupac recommendations 1994) making every sar point count: the development of chemistry connect for the large-scale integration of structure and bioactivity data the proposition bank: an annotated corpus of semantic roles managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents brat: a web-based tool for nlp-assisted text annotation patents and scientific papers: quite different concepts: the reward is found in giving quantifying the challenges in parsing patent claims annotating the biomedical literature for the human variome detecting chemical reactions in patents improving chemical named entity recognition in patents with contextualized word embeddings acknowledgments. this work is supported by an australian research council linkage project, lp160101469, and elsevier. we would like to thank estrid he, zubair afzal and mark sheehan for supporting this work, as well as the anonymous reviewers for their feedback. key: cord-020811-pacy48qx authors: muhammad, shamsuddeen hassan; brazdil, pavel; jorge, alípio title: incremental approach for automatic generation of domain-specific sentiment lexicon date: 2020-03-24 journal: advances in information retrieval doi: 10.1007/978-3-030-45442-5_81 sha: doc_id: 20811 cord_uid: pacy48qx sentiment lexicon plays a vital role in lexicon-based sentiment analysis. the lexicon-based method is often preferred because it leads to more explainable answers in comparison with many machine learning-based methods. but, semantic orientation of a word depends on its domain. hence, a general-purpose sentiment lexicon may gives sub-optimal performance compare with a domain-specific lexicon. however, it is challenging to manually generate a domain-specific sentiment lexicon for each domain. still, it is impractical to generate complete sentiment lexicon for a domain from a single corpus. to this end, we propose an approach to automatically generate a domain-specific sentiment lexicon using a vector model enriched by weights. importantly, we propose an incremental approach for updating an existing lexicon to either the same domain or different domain (domain-adaptation). finally, we discuss how to incorporate sentiment lexicons information in neural models (word embedding) for better performance. sentiment lexicon is a dictionary of a lexical item with the corresponding semantic orientation. recently, with the issue of growing concern about interpretable and explainable artificial intelligence, domains that require high explainability in sentiment analysis task (eg., health domain and financial domain), lexicon-based sentiment analysis approaches are often preferred over machine-learning-based approaches [12, 13] . however, sentiment lexicons are domain-dependent, a word may convey two different connotations in a different domain. for example, the word high may have a positive connotation in economics (e.g., he has a high salary), and negative connotation in medicine (e.g., he has a high blood pressure). therefore, general-purpose sentiment lexicon may not give the expected predictive accuracy across different domains. thus, a lexicon-based approach with domain-specific lexicons are used to achieve better performance [1, 4] . although research has been carried out on corpus-based approaches for automatic generation of a domain-specific lexicon [1, 4, 5, 7, 9, 10, 14] , existing approaches focused on creation of a lexicon from a single corpus [4] . afterwards, one cannot automatically update the lexicon with a new corpus. there are many reasons one would want to update an existing lexicon: (i) the existing lexicon may not contain sufficient number of sentiment-bearing words (i.e., it is limited) and it needs to be extended with a corpus from the same domain with a source corpus; (ii) the language may have evolved (new words and meaning changes) and it is necessary to update the existing lexicon with a new corpus. the new corpus may not be large to enable generation of a new lexicon from scratch. thus, it is better to update the existing lexicon with the new corpus; and (iii) we need to update an existing lexicon to another domain (domainadaptation) with a corpus from different domain with the source corpus. to this end, this work proposes an incremental approach for the automatic generation of a domain-specific sentiment lexicon. we aim to investigate an incremental technique for automatically generating domain-specific sentiment lexicon from a corpus. specifically, we aim to answer the following three research questions: can we automatically generate a sentiment lexicon from a corpus and improves the existing approaches? rq2: can we automatically update an existing sentiment lexicon given a new corpus from the same domain (i.e., to extend an existing lexicon to have more entries) or from a different domain (i.e., to adapt the existing lexicon to a new domain -domain adaptation)? rq3: how can we enrich the existing sentiment lexicons using information obtained from neural models (word embedding)? to the best of our knowledge, no one attempted to design an approach for automatic construction of a sentiment lexicon in an incremental fashion. but, incremental approaches are common in the area of data streaming [15] ; thus, our work could fill this gap and represent a novel contribution. the research plan is structured as follows: sect. 2.1 attempts to answer rq1, sect. 2.2 attempts to answer rq2, and sect. 2.3 attempts to answer rq3. sattam et al. [4] introduced a novel domain agnostic sentiment lexicon-generation approach from a review corpus annotated with star-ratings. we propose an extended approach that includes the use of weight vector. also, our approach includes verbs and nouns in the lexicon as studies show they contain sentiment [7, 11] . the process includes the following four steps: (i) gathering data annotated with star-ratings; (ii) pre-processing the data; (iii) obtaining wordtag rating distribution, as shown in fig. 1 from the corpus introduced in [16] ; and (iv) generation of sentiment value for each word-tag pair using the equation: where f r w−t represents the frequency of word-tag pair and w is a weight vector. if the result is positive, the word is categorize as positive, otherwise it is negative. this basic approach of sentiment lexicon generation forms the basis of the incremental approach proposes in sect. 2.2. we propose an incremental approach for sentiment lexicon expansion to either the same domain or different domain (domain-adaptation). to illustrate the approaches, assume we have a sentiment lexicon l i generated from a corpus c i (using the approach described in sect. 2.1). then, we receive a new batch of corpus c i+1 (of the same or different domain with c i ). the incremental approach aims to generate an updated sentiment lexicon l i+1 that would improve the accuracy of the lexicon l i . assume we receive c i+1 and we want to update l i . assume we have the distributions of all the words in the previous corpus (c i ) saved. a naive approach would involve generating distributions of all the words in the new batch (c i+1 ) without creating a new lexicon from it. such a distribution represents the so-called "sufficient statistics" [15] and we can construct lexicon from each set of the distributions. to update l i , the two sets of distributions (from c i and c i+1 ) are first merged and updated lexicon (l i+1 ) is generated using the approach described in sect. 2.1. however, this approach may be inefficient since we update all the words in the existing lexicon. an enhanced and more efficient approach aims to update only subset of the words in l i whose orientation may have changed. this approach use l i to predict the user's sentiment rating scores on the new labelled corpus c i+1 sentences. if the predicted rating scores are the same with the user's sentiment ratings, we can skip those sentences and only consider those sentences where the predicted rating is significantly different from the user's sentiment rating scores. we extract the words from these sentences (reviews), elaborate the corresponding distribution of sentiment values, merge the distribution with the corresponding subset in the l i and generate a new sentiment lexicon l i+1 . assume we receive c i+1 and we want to update l i to a new domain. firstly, we propose to detect if c i+1 and c i are from different domain. to do this, we generate the distribution of c i+1 and compare it with the distribution of c i . if the distributions of the two corpora differ significantly, it indicates a domain shift. alternatively, we can use l i to predict the user's sentiment rating scores on the new labelled corpus c i+1 sentences. if the prediction accuracy is below some predefined threshold, we can conclude there is a domain shift. after detecting the domain shift, we merge the distribution using a similar approach discussed (in updating using the same corpus) and generate the lexicon. however, in this case, we give different weight to the two distributions by taking into consideration not only their size, but also recency. more recent batches will be given more weight than the previous ones. the idea of word embedding have been widely used for generation of sentiment lexicon because of their advantage for giving semantic representation of words [9] . if two words appear in similar contexts, they will have similar embedding. we propose to use word embedding in the following way. suppose we have seed words with their sentiment values, and we encounter some word, say wx, for which we do not have a sentiment value (sval) yet. but if we have its embedding, we can look for the most similar embedding in the embedding space and retrieve the corresponding word, wy, retrieve its sval and use it as a sval of wx. as reported in [11] , neural models performance can increase by including lexicon information. we aim to further study litreture and find how to exploit combination of an existing sentiment lexicon (more explainable) and neural models performance. we plan to evaluate our system and compare it with other five existing lexicons: sentiwords, splm, so-cal, bing liu's opinion lexicon, and sentiword-net [14] . the evaluation task will be on three sentiment analysis tasks (movie review, polarity of tweets and hotel review). in these comparisons we will compare (1) the precision of the predictions of sentiment values and (2) runtime to carry out updates of the lexicon. we seek suggestions on how our proposal can be improved. more importantly, discussion on how to exploit combination of word embedding with sentiment lexicon. we also welcome comments. cognitive-inspired domain adaptation of sentiment lexicons sentiment lexicon generation constructing automatic domainspecific sentiment lexicon using knn search via terms discrimination vectors automatic construction of domain-specific sentiment lexicons for polarity classification inducing domain-specific sentiment lexicons from unlabeled corpora determining the level of clients' dissatisfaction from their commentaries lexicon-based methods for sentiment analysis automatic domain adaptation outperforms manual domain adaptation for predicting financial outcomes word embeddings for sentiment analysis: a comprehensive empirical survey sentiment lexicon construction with representation learning based on hierarchical sentiment supervision lexicon information in neural sentiment analysis: a multi-task learning approach explainable sentiment analysis with applications in medicine explainable artificial intelligence: a survey an overview of sentiment analysis approaches knowledge discovery from data streams on the negativity of negation acknowledgement. this project was partially financed by the portuguese funding agency, fct -fundação para a ciência e a tecnologia, through national funds, and co-funded by the feder. key: cord-020896-yrocw53j authors: agarwal, mansi; leekha, maitree; sawhney, ramit; ratn shah, rajiv; kumar yadav, rajesh; kumar vishwakarma, dinesh title: memis: multimodal emergency management information system date: 2020-03-17 journal: advances in information retrieval doi: 10.1007/978-3-030-45439-5_32 sha: doc_id: 20896 cord_uid: yrocw53j the recent upsurge in the usage of social media and the multimedia data generated therein has attracted many researchers for analyzing and decoding the information to automate decision-making in several fields. this work focuses on one such application: disaster management in times of crises and calamities. the existing research on disaster damage analysis has primarily taken only unimodal information in the form of text or image into account. these unimodal systems, although useful, fail to model the relationship between the various modalities. different modalities often present supporting facts about the task, and therefore, learning them together can enhance performance. we present memis, a system that can be used in emergencies like disasters to identify and analyze the damage indicated by user-generated multimodal social media posts, thereby helping the disaster management groups in making informed decisions. our leave-one-disaster-out experiments on a multimodal dataset suggest that not only does fusing information in different media forms improves performance, but that our system can also generalize well to new disaster categories. further qualitative analysis reveals that the system is responsive and computationally efficient. the amount of data generated every day is colossal [10] . it is produced in many different ways and many different media forms. analyzing and utilizing this data to drive the decision-making process in various fields intelligently has been the primary focus of the research community [22] . disaster response management is one such area. natural calamities occur frequently, and in times of such crisis, if the large amount of data being generated across different platforms is harnessed m. agarwal and m. leekha-the authors contributed equally, and wish that they be regarded as joint first authors. rajiv ratn shah is partly supported by the infosys center for ai, iiit delhi. well, the relief groups will be able to make effective decisions that have the potential to enhance the response outcomes in the affected areas. to design an executable plan, disaster management and relief groups should combine information from different sources and in different forms. however, at present, the only primary source of information is the textual reports which describe the disaster's location, severity, etc. and may contain statistics of the number of victims, infrastructural loss, etc. motivated by the cause of humanitarian aid in times of crises and disasters, we propose a novel system that leverages both textual and visual cues from the mass of user-uploaded information on social media to identify damage and assess the level of damage incurred. in essence, we propose memis, a system that aims to pave the way to automate a vast multitude of problems ranging from automated emergency management, community rehabilitation via better planning from the cues and patterns observed in such data and improve the quality of such social media data to further the cause of immediate response, improving situational awareness and propagating actionable information. using a real-world dataset, crisismmd, created by alam et al. [1] , which is the first publicly available dataset of its kind, we present the case for a novel multimodal system, and through our results report its efficiency, effectiveness, and generalizability. in this section, we briefly discuss the disaster detection techniques of the current literature, along with their strengths and weaknesses. we also highlight how our approach overcomes the issues present in the existing ones, thereby emphasizing the effectiveness of our system for disaster management. chaudhuri et al. [7] examined the images from earthquake-hit urban environments by employing a simple cnn architecture. however, recent research has revealed that often fine-tuning pre-trained architectures for downstream tasks outperform simpler models trained from scratch [18] . we build on this by employing transfer learning with several successful models from the imagenet [9] , and observed significant improvements in the performance of our disaster detection and analysis models, in comparison to a simple cnn model. sreenivasulu et al. [24] investigated microblog text messages for identifying those which were informative, and therefore, could be used for further damage assessment. they employed a convolutional neural network (cnn) for modeling the text classification problem, using the dataset curated by alam et al. [1] . extending on their work on crisismmd, we experimented with several other state-of-the-art architectures and observed that adding recurrent layers improved the text modeling. although researchers in the past have designed and experimented with unimodal disaster assessment systems [2, 3] , realizing that multimodal systems may outperform unimodal frameworks [16] , the focus has now shifted to leveraging information in different media forms for disaster management [20] . in addition to using several different media forms and feature extraction techniques, several researchers have also employed various methods to combine the information obtained from these modalities, to make a final decision [19] . yang et al. [28] developed a multimodal system-madis which leverages both text and image modalities, using hand-crafted features such as tf-idf vectors, and low-level color features. although their contribution was a step towards advancing damage assessment systems, the features used were relatively simple and weak, as opposed to the deep neural network models, where each layer captures complex information about the modality [17] . therefore, we utilize the latent representation of text and image modalities, extracted from their respective deep learning models, as features to our system. another characteristic that is essential for a damage assessment system is generalizability. however, most of the work carried out so far did not discuss this practical perspective. furthermore, to the best of our knowledge, so far no work has been done on developing an end-to-end multimodal damage identification and assessment system. to this end, we propose memis, a multimodal system capable of extracting information from social media, and employs both images and text for identifying damage and its severity in real-time (refer sect. 3). through extensive quantitative experimentation in the leave-one-disaster-out training setting and qualitative analysis, we report the system's efficiency, effectiveness, and generalizability. our results show how combining features from different modalities improves the system's performance over unimodal frameworks. in this section, we describe the different modules of our proposed system in greater detail. the architecture for the system is shown in fig. 1 . the internal methodological details of the individual modules are in the next section. the tweet streaming module uses the twitter streaming api 1 to scrap realtime tweets. as input to the api, the user can enter filtering rules based on the available information like hashtags, keywords, phrases, and location. the module outputs all the tweets that match these defined cases as soon as they are live on social media. multiple rules can be defined to extract tweets for several disasters at the same time. data from any social media platform can be used as input to the proposed framework. however, in this work, we consume disaster-related posts on twitter. furthermore, although the proposed system is explicitly for multimodal tweets having both images and text, we let the streaming module filter both unimodal and multimodal disaster tweets. we discuss in sect. 5.5 how our pipeline can be generalized to process unimodal tweets as well, making it more robust. a large proportion of the tweets obtained using the streaming module may be retweets that have already been processed by the system. therefore, to avoid overheads, we maintain a list of identifiers (ids) of all tweets that have been processed by the system. in case an incoming tweet is a retweet that has already been processed by the system before, we discard it. furthermore, some tweets may also have location or geographic information. this information is also stored to maintain a list of places where relief groups are already providing services currently. if a streamed geo-tagged tweet is from a location where the relief groups are already providing aid, the tweet is not processed further. a substantial number of tweets streamed from the social media platforms are likely to be irrelevant for disaster response and management. furthermore, different relief groups have varying criteria for what is relevant to them for responding to the situation. for instance, a particular relief group could be interested only in reaching out to the injured victims, while another provides resources for infrastructural damages. therefore, for them to make proper use of information from social media platforms, the relevant information must be filtered. we propose two sub-modules for filtering: (i) the first filters the informative tweets, i.e., the tweets that provide information relevant to a disaster, which could be useful to a relief group, (ii) the second filter is specific to the relief group, based on the type of damage response they provide. to demonstrate the system, in this work, we filter tweets that indicate infrastructural damage or physical damage in buildings and other structures. finally, once the relevant tweets have been filtered, we analyze them for the severity of the damage indicated. the system categorizes the severity of infrastructural damage into three levels: high, medium and low. based on the damage severity assessment by the system, the relief group can provide resources and services to a particular location. this information must further be updated in the database storing the information about all the places where the group is providing aid currently. furthermore, although not shown in the system diagram, we must also remove a location from the database once the relief group's activity is over, and it is no longer actively providing service there. this ensures that if there is an incoming request from that location after it was removed from the database, it can be entertained. in this section, we discuss the implementation details of the two main modules of the system for relevance filtering and severity analysis. we begin by describing the data pre-processing required for the multimodal tweets, followed by the deep learning-based models that we use for the modules. image pre-processing: the images are resized to 299 × 299 for the transfer learning model [29] and then normalized in the range [0, 1] across all channels (rgb). text pre-processing: all http urls, retweet headers of the form rt, punctuation marks, and twitter user handles specified as @username are removed. the tweets are then lemmatized and transformed into a stream of tokens that can be fed as input to the models used in the downstream modules. these tokens act as indices to an embedding matrix, which stores the vector representation for tokens corresponding to all the words maintained in the vocabulary. in this work, we use 100 dimensional fasttext word-embeddings [6] , trained on the cri-sismmd dataset [1] that has been used in this work. the system as a whole, however, is independent of the choice of vector representation. for the proposed pipeline, we use recurrent convolutional neural network (rcnn) [14] as the text classification model. it adds a recurrent structure to the convolutional block, thereby capturing contextual information with long term dependencies and the phrases which play a vital role at the same time. furthermore, we use the inception-v3 model [25] , pre-trained on the imagenet dataset [9] for modelling the image modality. the same underlying architectures, for both text and image respectively, are used to filter the tweets that convey useful information regarding the presence of infrastructural damage in the relevance filtering modules, and the analysis of damage in the severity analysis module. therefore, we effectively have three models for each modality: first for filtering the informative tweets, then for those pertaining to the infrastructural damage (or any other category related to the relief group), and finally for assessing the severity of damage present. in this subsection, we describe how we combine the unimodal predictions from the text and image models for different modules. we also discuss in each case about how the system would treat a unimodal text or image only input tweet. gated approach for relevance filtering. for the two modules within relevance filtering, we use a simplistic approach of combining the outputs from the text and image models by using the or function (⊕). technically speaking, we conclude that the combined output is positive if at least one of the unimodal models predicts so. therefore, if a tweet is predicted as informative by either the text, or the image, or both the models, the system predicts the tweet as informative, and it is considered for further processing in the pipeline. similarly, if at least one of the text and the image modality predicts an informative tweet as containing infrastructural damage, the tweet undergoes severity analysis. this simple technique helps avoid missing any tweet that might have even the slightest hint of damage, in either or both the modalities. any false positive can also be easily handled in this approach. if, say, a non-informative tweet is predicted as informative in the first step at relevance filtering, it might still be the case that in the second step, the tweet is predicted as not containing any infrastructural damage. furthermore, in case a tweet is unimodal and has just the text or the image, then the system can take the default prediction of the missing modality as negative (or false for a boolean or function), which is the identity for the or operation. in that case, the prediction based on the available modality will guide the analysis (fig. 2) . attention fusion for severity analysis. the availability of data from different media sources has encouraged researchers to explore and leverage the potential boost in performance by combining unimodal classifiers trained on individual modalities [5, 27] . here, we use attention fusion to combine the feature interpretations from the text and image modalities for the severity analysis module [12, 26] . the idea of attention fusion is to attend particular input features as compared to others while predicting the output class. the features, i.e., the outputs of the penultimate layer or the layer before the softmax, of the text and image models are concatenated. this is followed by a softmax layer to learn the attention weights for each feature dimension, i.e., the attention weight α i for a feature x i is given by: therefore, the input feature after applying the attention weights is, where, i, j ∈ 1, 2, .., p, and p is the total number of dimensions in the multimodal concatenated feature vector. w is the weight matrix learned by the model. this vector of attended features is then used to classify the given multimodal input. with this type of fusion, we can also analyze how the different modalities are interacting with each other employing their attention weights. moving from the relevance filtering to the severity analysis module, we strengthen our fusion technique by using attention mechanism. this is required since human resources are almost always scarce, and it is necessary to correctly assess the requirements at different locations based on the severity of the damage. as opposed to an or function, using attention, we are able to combine the most important information as seen by the different modalities to together analyze the damage severity. in this case, the treatment of unimodal tweets is not that straightforward, since the final prediction using attention fusion occurs after concatenation of the latent feature vectors of the individual modalities. therefore, in case the text or image is missing, we use the unimodal model for the available modality. in other words, we use attention mechanism only when both the modalities are present to analyze damage severity, else we use the unimodal models. recently, several datasets on crisis damage analysis have been released to foster research in the area [21] . in this work, we have used the first multimodal, labeled, publicly available damage related to the twitter dataset, crisismmd, created by alam et al. [1] . it was collected by crawling the blogs posted by users during seven natural disasters, which can be grouped into 4 disaster categories, namely-floods, hurricanes, wildfires and earthquakes. crisismmd introduces three hierarchical tasks: 1. informativeness. this initial task classifies each multimodal post as informative or non-informative. alam et al. define a multimodal post as informative if it serves to be useful in identifying areas where damage has occurred due to disaster. it is therefore a binary classification problem, with the two classes being informative and non-informative. 2. infrastructural damage. the damage in an informative tweet may be of many different kinds [1, 4] . crisismmd identifies several categories for the type of damage, namely-infrastructure and utility damage, vehicle damage, affected individuals, missing or found people, other relevant information, none. alam et al. [1] also noted that the tweets which signify physical damage in structures, where people could be stuck, are especially beneficial for the rescue operation groups to provide aid. out of the above-listed categories, the tweets having infrastructure and utility damage are therefore identified in this task. this again is modelled as a classification problem with two classesinfrastructural and non-infrastructural damage. 3. damage severity analysis. this final task uses the text and image modalities together to analyze the severity of infrastructural damage in a tweet ashigh, medium, or low. we add another label, no-damage, to support the pipeline framework that can handle false positives as well. specifically, if a tweet having no infrastructural damage is predicted as positive, it can be detected here as having no damage. this is modelled as a multi-class classification problem. the individual modules of the proposed pipeline essentially model the above three tasks of crisismmd. specifically, the two relevance filtering modules model the first and the second tasks, respectively, whereas the severity analysis module models the third task (table 1) . to evaluate how well our system can generalize to new disaster categories, we train our models for all the three tasks in a leave-one-disaster-out (lodo) training paradigm. therefore, we train on 3 disaster categories and evaluate the performance on the left-out disaster. to handle the class imbalance, we also used smote [8] with the word embeddings of the training fold samples for linguistic baselines. we used adam optimizer with an initial learning rate of 0.001, the values of β1 and β2 as 0.9 and 0.999, respectively, and a batch size of 64 to train our models. we use f1-score as the metric to compare the model performance. all the models were trained on a geforce gtx 1080 ti gpu with a memory speed of 11 gbps. to demonstrate the effectiveness of the proposed system for multimodal damage assessment on social media, we perform an ablation study, the results for which have been described below. design choices. we tried different statistical and deep learning techniques for modelling text-tf-idf features with svm, naive bayes (nb) and logistic regression (lr); and in the latter category, cnn [13] , hierarchical attention model (hattn), bidirectional lstm (bilstm) and rcnn [14] . as input to the deep learning models, we use 100-dimensional fasttext word embeddings [6] trained on the dataset. by operating at the character n-gram level, fasttext tends to capture the morphological structure well. thus, helping the otherwise out of vocabulary words (such as hash-tags) to share semantically similar embeddings with its component words. as shown in table 2 , the rcnn model performed the best on all three tasks of the relevance filtering and severity analysis modules. specifically, the average lodo f1-scores of rcnn on the three tasks are 0.82, 0.76, and 0.79, respectively. furthermore, the architecture considerably reduces the effect of noise in social media posts [14] . for images, we fine-tuned the vgg-16 [23] , resnet-50 [11] and incep-tionv3 [25] models, pre-trained on the imagenet dataset [9] . we also trained a cnn model from scratch. experimental results in table 2 reveal that incep-tionv3 performed the best, and the average f1-score with lodo training for the three tasks are 0.74, 0.77, and 0.79, respectively. the architecture employs multiple sized filters to get a thick rather than a deep architecture, as very deep networks are prone to over-fitting. such a design makes the network computationally less expensive, which is a prime concern for our system as we want to minimize latency to give quick service to the disaster relief groups. table 3 highlights the results of an ablation study over the best linguistic and vision models, along with the results obtained when the predictions by these individual models are combined as discussed in sect. 4.3. the results for all the modules demonstrate the effectiveness of multimodal damage assessment models. specifically, we observe that for each disaster category in the lodo training paradigm, the f1-score for the multimodal model is always better than or compares with those of the text and image unimodal models. in this section, we analyze some specific samples to understand the shortcomings of using unimodal systems, and to demonstrate the effectiveness of our proposed multimodal system. table 4 records these sample tweets along with their predictions as given by the different modules. in green are the correct predictions, whereas the incorrect ones are shown in red they have been discussed below in order: 1. the image in the first sample portrays the city landscape from the top, damaged by the calamity. due to the visual noise, the image does not give much information about the intensity of damage present, and therefore, the image model incorrectly predicts the tweet as mildly damaged. on the other hand, the text model can identify the severe damage indicated by phrases like 'hit hard'. combining the two predictions by using attention fusion, therefore, helps in overcoming the unimodal misclassifications. in this tweet, the text uses several keywords, such as 'damaged' and 'earthquake', which misleads the text model in predicting it as severely damaged. however, the image does not hold the same perspective. by combining the feature representations, attention fusion can correctly predict the tweet as having mild damage. the given tweet is informative and therefore, it is considered for damage analysis. however, the text classifier, despite the presence of words like 'killed' and 'destroyed', incorrectly classifies it to the non-infrastructural damage class. the image classifier correctly identifies the presence of damage, and therefore, the overall prediction for the tweet is infrastructural damage, which is correct. furthermore, both the text and image models are unable to identify the severity of damage present, but the proposed system can detect the presence of severe damage using attention fusion. the sample shows how the severity analysis module combines the text and visual cues by identifying and attending to more important features than others. this helps in modelling the dependency between the two modalities, even when both, individually give incorrect predictions. the image in the tweet shows some hurricane destroyed structures, depicting severe damage. however, the text talks about 'raising funds and rebuilding', which does not indicate severe damage. the multimodal system learns to attend the text features more and correctly classifies the sample as having no damage, even though both the individual models predicted incorrectly. furthermore, in this particular example, even by using the or function, the system could not correctly classify it as not having infrastructural damage. yet, the damage severity analysis module identifies this false positive and correctly classifies it. in this section, we discuss some of the practical and deployment aspects of our system, as well as some of its limitations. we simulate an experiment to analyze the computational efficiency of the individual modules in terms of the time they take to process a tweet, i.e., the latency. we are particularly interested in analyzing the relevance filtering and severity analysis modules. we developed a simulator program to act as the tweet streaming module that publishes tweets at different load rates (number of tweets in 1 second) to be processed by the downstream modules. the modules also process the incoming tweets at the same rate. we calculate the average time for processing a tweet by a particular module as the total processing time divided by the total number of tweets used in the experiment. we used 15, 000 multimodal tweets from crisismmd, streamed at varying rates. the performance of the two relevance filtering modules and the severity analysis module as we gradually increase the load rate is shown in the fig. 3 . as a whole, including all the modules, we observed that on an average, the system can process 80 tweets in 1 minute. this experiment was done using an intel i7-8550u cpu having 16 gb ram. one can expect to see an improvement if a gpu is used over a cpu. generalization. the proposed system is also general and robust, especially in three aspects. firstly, the results of our lodo experiments indicate that the system can perform well in case it is used for analyzing new disasters, which were not used for training the system. this makes it suitable for real-world deployment where circumstance with new disaster categories cannot be foreseen. furthermore, we also saw how the two main modules of the system work seamlessly, even when one of the modalities is missing. this ensures that the system can utilize all the information that is available on the media platforms to analyze the disaster. finally, the second module in relevance filtering can be trained to suit the needs of several relief groups that target different types of damage, and therefore, the system is capable of being utilized for many different response activities. limitations. although the proposed system is robust and efficient, some limitations must be considered before it can be used in real-time. firstly, the system is contingent on the credibility i.e., the veracity of the content shared by users on social media platforms. it may so happen that false information is spread by some users to create panic amongst others [15] . in this work, we have not evaluated the content for veracity, and therefore, it will not be able to differentiate such false news media. another aspect that is also critical to all systems that utilize data generated on social media is the socio-economic and geographic bias. specifically, the system will only be able to get information about the areas where people have access to social media, mostly the urban cities, whereas damage in the rural locations may go unnoticed since it did not appear on twitter or any other platform. one way to overcome this is to make use of aerial images, that can provide a top view of such locations as the rural lands. however, this again has a drawback as to utilize aerial images effectively, a bulk load of data would have to be gathered and processed. identifying damage and human casualties in real-time from social media posts is critical to providing prompt and suitable resources and medical attention, to save as many lives as possible. with millions of social media users continuously posting content, an opportunity is present to utilize this data and learn a damage recognition system. in this work, we propose memis, a novel multimodal emergency management information system for identifying and analyzing the level of damage severity in social media posts with the scope for betterment in disaster management and planning. the system leverages both textual and visual cues to automate the process of damage identification and assessment from social media data. our results show how the proposed multimodal system outperforms the state-of-the-art unimodal frameworks. we also report the system's responsiveness through extensive system analysis. the leave-one-disaster-out training setting proves the system is generic and can be deployed for any new unseen disaster. crisismmd: multimodal twitter datasets from natural disasters processing social media images by combining human and machine computing during crises crisisdps: crisis data processing services a twitter tale of three hurricanes: harvey, irma, and maria. arxiv multimodal vehicle detection: fusing 3d-lidar and color camera data enriching word vectors with subword information application of image analytics for disaster response in smart cities smote: synthetic minority over-sampling technique imagenet: a large-scale hierarchical image database how much data do we create every day? the mind-blowing stats everyone should read deep residual learning for image recognition an attention-based decision fusion scheme for multimedia information retrieval convolutional neural networks for sentence classification recurrent convolutional neural networks for text classification from chirps to whistles: discovering eventspecific informative content from twitter damage identification in social media posts using multimodal deep learning handcrafted vs. non-handcrafted features for computer vision classification a survey on transfer learning multimodal deep learning based on multiple correspondence analysis for disaster management a computationally efficient multimodal classification approach of disaster-related twitter images natural disasters detection in social media and satellite imagery: a survey multimodal analysis of user-generated multimedia content very deep convolutional networks for large-scale image recognition detecting informative tweets during disaster using deep neural networks rethinking the inception architecture for computer vision attention is all you need multimodal fusion of eeg and fmri for epilepsy detection madis: a multimedia-aided disaster information integration system for emergency management how transferable are features in deep neural networks? in: advances in neural information processing systems key: cord-020936-k1upc1xu authors: sanz-cruzado, javier; macdonald, craig; ounis, iadh; castells, pablo title: axiomatic analysis of contact recommendation methods in social networks: an ir perspective date: 2020-03-17 journal: advances in information retrieval doi: 10.1007/978-3-030-45439-5_12 sha: doc_id: 20936 cord_uid: k1upc1xu contact recommendation is an important functionality in many social network scenarios including twitter and facebook, since they can help grow the social networks of users by suggesting, to a given user, people they might wish to follow. recently, it has been shown that classical information retrieval (ir) weighting models – such as bm25 – can be adapted to effectively recommend new social contacts to a given user. however, the exact properties that make such adapted contact recommendation models effective at the task are as yet unknown. in this paper, inspired by new advances in the axiomatic theory of ir, we study the existing ir axioms for the contact recommendation task. our theoretical analysis and empirical findings show that while the classical axioms related to term frequencies and term discrimination seem to have a positive impact on the recommendation effectiveness, those related to length normalization tend to be not desirable for the task. with the large-scale growth of social network platforms such as twitter or facebook, recommender systems technology that targets explicit social scenarios has seen a surge of interest [32, 37] . as part of this trend, the adaptation of information retrieval (ir) approaches to recommend people to connect to in the network have been particularly studied [17, 34] . this specific class of recommender systems has the interesting property that users play a dual role: they are the users to whom we want to provide recommendations, but they are also the items we want to recommend [32] . recently, it has been shown that classical ir weighting models -such as bm25 -can not only be used, but are also effective and efficient for the contact recommendation task [34] . in fact, recommender systems have always had strong connections with textual information retrieval (ir), since both tasks can be considered as particular cases of information filtering [9] . these ties have been materialized in the design and development of recommendation approaches based on ir models [2, 10, 39] . content-based recommender systems [2] have been the most direct realization of such ties. however, we also note the collaborative filtering methods of [10, 39] , which employed the vector space model or query likelihood to their advantage. in this paper, we analyze the reasons behind the effectiveness of ir approaches for the task of recommending contacts in social networks, through an exploratory analysis of the importance and validity of the fundamental ir axioms [13] . we start our analysis by examining contact recommendation methods that directly adapt ir models [34] , as they provide a bridge between existing work on axiomatic analysis in ir models, and this new task. in particular, we empirically analyze whether satisfying the ir axioms leads to an increase in the performances of the algorithms. interestingly, we find that while this is generally true, the axioms related to length normalization negatively impact the contact recommendation performance, since they interfere with a key evolutionary principle in social networks, namely preferential attachment [8] . by identifying the set of properties that an ir model must (at least) follow to provide effective results, axiomatic thinking as developed by fang et al. [12] has permitted to guide the development of both sound and effective ir approaches by explaining, diagnosing and improving them. in their seminal work, fang et al. [12] proposed several heuristics (known as axioms) addressing different properties of the models such as the frequency of the query terms in the retrieved documents, the relative discrimination between query terms, or how a model deals with long documents. they also analyzed the effect such properties had on the effectiveness of state-of-the-art models such as bm25 [29] or query likelihood [27] , and found that, with minor modifications to adhere to the different proposed axioms, the modified ir models achieved an improved retrieval performance. since the seminal work of fang et al., the original axioms have been refined and expanded [13, 35] , and other additional properties of effective ir models have been studied, such as the semantic relations between queries and documents [14] or term proximity [38] . recently, axiomatic analysis has been applied on neural ir models: rennings et al. [28] proposed a method for empirically checking if the learned neural models fulfil the different ir axioms, while rosset et al. [30] used the axioms as constraints for guiding the training of neural models. beyond ir, axiomatic analysis has also expanded to other areas such as recommender systems, where valcarce et al. [39, 40] explored the benefits of penalizing users who rate lots of items when selecting neighbors in user-based knn approaches. in this paper, using the ir-based contact recommendation framework proposed by sanz-cruzado and castells [34] as a basis, we map the ir axioms of fang et al. [13] into the task of recommending people in social networks, and empirically analyze how valid and meaningful each axiom is for this task. we first introduce the notations we use during the rest of the paper. given a social network, we represent its structure as a graph g = u, e , where u denotes the set of people in the network and e is the set of relationships between users. for each user u ∈ u, we denote by γ (u) the set of users with whom u has established relationships (the neighborhood of user u). in directed networks, three different neighborhoods can be considered depending on the link orientation: users who have a link towards u, γ in (u); users towards whom u has a link, γ out (u) ; and the union of both, γ und (u). we define γ inv (u) as the inverse neighborhood of u, i.e. the neighborhood u would have if the orientation of the links is reversed. weighted networks additionally include a function w : unweighted networks can be seen as a particular case where w : u 2 → {0, 1}. then, given a target user u, the contact recommendation task consists of suggesting a subset of usersγ out (u) ⊂ u \γ out (u) towards whom u has no links but who might be of interest for u. we define the recommendation task as a ranking problem, in which the result setγ out (u) is obtained and sorted by a ranking function f u : u \ γ out (u) → r. since we explore the importance of ir axioms for contact recommendation, we need to establish connections between both tasks. we take for this purpose the mapping proposed in [34] : we fold the three spaces in the ir task (documents, queries and terms) into a single space for people to people recommendation, namely the users in the network. we map queries and documents to the target and candidate users, respectively. we also use the neighbors of both target and candidate users as equivalent to the terms contained in the queries and documents. as proposed by sanz-cruzado and castells [34] , we might use different neighborhoods to represent the target and candidate users (we could take either γ in , γ out or γ und for each of them). we denote by γ q (u) the neighborhood representing the target user, and by γ d (v) the one for the candidate user. the frequency of a term t in a document is represented as an edge weight w d (v, t) in our mapping: where 1 x is equal to one when the condition x is true, or 0 otherwise. in textual ir, the frequency is the basis to establish a measure of how important a term is for a document, and it is always positive. therefore, we assume that w d ≥ 0, and w d (v, t) = 0 if and only if t / ∈ γ d (v). the higher the importance of the link (v, t), the higher the weight w d (v, t) should be. in our experiments (described in sect. 6), we use the number of interactions (i.e. retweets, mentions) between users as an example definition of w d (v, t). in those network datasets where this type of information is not available, we simply use binary weights. finally, the document length is mapped to the sum of the weights of the neighborhood of the target user: len(v) = t∈γ l (v) w l (v, t), which can be seen as a generalized notion of vertex degree in the social graph. for some methods (such as bm25 [29] ), we may consider a different neighborhood orientation when computing the user "size"; this explains the different symbols γ l , w l (not necessarily equal to γ d , w d ) in the definition of len(v). in this framework, as the ir models rely on common neighbors between the target and the candidate user, they can only recommend people at distance 2 1 . table 1 summarizes the relation between the ir and contact recommendation tasks. further details about the mapping are described in [34] . before analyzing the importance of the ir axioms in the recommendation task, we first recall the ir axioms, and reformulate them using the mapping from ir to contact recommendation. in the remainder of this section, we take the seven axioms proposed by fang et al. [13] , divided into four categories, and analyze them. the first family of axioms analyzes the role of the frequency of the query terms in the retrieved documents. since term frequencies are represented as edge weights in our framework, we rename them as "edge weight constraints" (ewc) in our reformulation. the first constraint, tfc1, establishes that if the only difference between two documents is the frequency of a query term, then, the document with the higher term frequency should be ranked atop of the other. the intuition behind this axiom is naturally translated to contact recommendation by considering the "common friends" principle in social bonding: all things being equal, you are more likely to connect to people who have stronger bonds to common friends. this principle can be expressed as follows: ewc1: if the target user u has a single neighbor γ q (u) = {t}, and we have two different candidate users v 1 , v 2 such that len(v 1 ) = len(v 2 ), and the second term frequency constraint (tfc2) establishes that the ranking score increment produced by increasing term frequency should decrease with the frequency (i.e. ranking scores should have a dampened growth on term frequency, as in a diminishing returns pattern). this also has a direct meaning in the contact recommendation space: the difference in scores between two candidate contacts should decrease with the weights of their common friends with the target user. formally, this constraint is expressed as: ewc2: for a target user u with a single neighbor γ q (u) = {t}, and three finally, the third axiom reflects the following property: occurrence frequencies and discriminative power being equal, the document that covers more distinct query terms should attain a higher score. in people recommendation, this translates to the triadic closure principle [25, 26] : all other things being equal, the more common friends a candidate contact has with the target user, the higher the chance that a new link between them exists. formally: . where td(t) is a measure of the informativeness of the common neighbors of the target and candidate users, as can be obtained from an idf measure. these three axioms are interdependent: if we take γ q (u) = {t} and we fix the values for td(t) and len(v), we could rewrite f u (v) as a function of the document is strictly subadditive. given a function g, g positive and concave ⇒ g is increasing and subadditive. therefore, for such functions (as is the case for most of the classic ir functions), ewc2 ⇒ ewc1 ∧ ewc3. however, if ewc2 is not satisfied, either ewc1 or ewc3 could still be satisfied. the term discrimination constraint is an axiom that formalizes the intuition that penalizing popular words in the collection (such as stopwords) and assigning higher weights to more discriminative query terms should produce better search results. this principle makes sense in contact recommendation: sharing a very popular and highly connected friend (e.g. two people following katy perry on twitter) may be a rather weak signal to infer that these two people would relate to each other. a less social common friend, however, may suggest the two people may indeed have more interests in common. this idea is in fact reflected in some contact recommendation algorithms such as adamic-adar [1, 22] . hence, we rename the axiom as "neighbor discrimination constraint" (ndc), and we adapt the version of the axiom proposed by shi et al. [35] , which simplifies the translation to our domain, as follows: the third family of ir axioms studies how algorithms should deal with the length of the documents. as defined in sect. 3, in our mapping, the length of the document is translated to the sum of the edge weights between the candidate user and its neighbors: len(v). as we only study the length of the candidate user, we will rename this family of constraints as "candidate length normalization constraints" (clnc). fang et al. [13] proposed two different lncs. the first axiom states that for two documents with the same query term occurrence frequency, we should choose the shorter one, since it contains the least amount of query-unrelated information. in contact recommendation, this means penalizing popular, highly connected candidate users with many neighbors not shared with the target user. we hence reformulate this axiom as: clnc1: given a target user u and two candidate users v 1 the second constraint aims to avoid over-penalizing long documents: it states that if a document is concatenated to itself multiple times, the resulting document should not get a lower score than the original. in contact recommendation, this means that, if we multiply all the edge weights of a candidate user by a positive number, the score for the candidate user should not decrease. formally: for all users x and some constant k > 1, and w d (v 1 , t) > 0 for some neighbor t ∈ γ q (u) of the target user u, then we have f u (v 1 ) ≥ f u (v 2 ). the last heuristic aims to provide a balance between query term frequency in documents and length normalization. the axiom states that if we add more occurrences of a query term to a document, its retrieval score should increase. for contact recommendation, the intuition is similar: if the link weight between two users v and t increases, then v's score as a candidate for target users having t in their neighborhood should increase. this axiom is then expressed as follows: ew-clnc: given a target user u with a single neighbor γ q (u) = {t}, if two candidates v 1 and v 2 are such that w d (v 1 , t) > w d (v 2 , t) and len(v 1 ) = len the first step to undertake an analysis of the ir axioms in contact recommendation is to determine the set of algorithms for which the different axioms are applicable, and, for those, to identify which constraints they satisfy and under which conditions. in this section, we provide an overview of different contact recommendation methods and their relation with the axioms. we divide the approaches into two groups: friends of friends approaches, which only recommend people at network distance 2 from the target user, and methods which might recommend more distant users. the first group includes all ir models, as well as other approaches such as the most common neighbors (mcn) and adamic-adar's approach [22] , whereas the second group includes matrix factorization [18, 21] , random walk-based methods [16, 41] and knn [2] . the proposed set of constraints is not applicable to the algorithms in the second group, since the constraints are based on the idea that the weighting functions depend on the common users between the target and the candidate users. therefore, in the rest of the article, we focus on the algorithms in the first family. as future work, we envisage the formulation of new constraints tailored for algorithms that recommend users at distance greater than 2, possibly as a generalization of the set of constraints we study in this paper (see e.g. the formal analysis of pseudo-relevance feedback by clinchant and gaussier [11] , which in our mapping would correspond to distance greater than 2). we start analyzing the friends of friends methods by studying the ir models. in the adaptation of these models by sanz-cruzado and castells [34] , the components of the ranking functions (frequency/weight, discriminative power functions, document/user length) maintain the basic properties on which the formal analysis by fang et al. [12, 13] has relied. therefore, the adapted methods satisfy the same constraints in the social network as those satisfied in the text ir space, and, if they are only satisfied under certain conditions, we can find the new conditions just by adapting them for the contact recommendation task. then, models like pl2 [3, 7] , the pivoted normalization vector space model (vsm) [36] query likelihood with dirichlet (qld) [42] or jelinek-mercer smoothing (qljm) [27] keep their original properties in this new space. we find however one point of difference related to a possibility considered by sanz-cruzado and castells in the definition of the candidate user length; namely, that we can define the length of the candidate users by selecting a different neighborhood γ l (v) than the one used for defining the candidate user, γ d (v), as explained in sect. 3. as the only difference between the original and the version of bm25 defined by sanz-cruzado and castells is just the definition of the candidate length, it is straightforward to prove that all edge weight constraints and ndc are satisfied in the same way as they are for textual ir: ndc is unconditionally true, whereas all ewc axioms depend just on the condition: which, in contact recommendation, is likely to be true -indeed, as of 2019, twitter has >300 m users, and, the most followed user has just 107 m followers. on the other hand, differences arise when we study the constraints involving length normalization: clncs and ew-clnc. if we keep the same orientation for the user length and neighborhood selection for the candidate user, the mapping maintains the same components as the original ranking function, and, consequently, the condition for satisfying the three axioms is the same as the original: satisfying condition c 1 . however, if the orientation for the length is changed, it is easy to show that, for clnc1, bm25 satisfies the axiom if both conditions c 1 and c 2 are true, or both are false, where: and, for the ew-clnc, the constraint is kept if conditions c 1 and c 3 are met, or none of them are, where: the only length normalization-related constraint that is satisfied under the same conditions as the original bm25 model is the clnc2 constraint, since it does not really depend on the definition of user length. table 3 shows the differences between the original version and this adaptation of the bm25 model for contact recommendation. hence, we introduce a new ir-based approach, namely the extreme bm25 (ebm25) method, a variant of bm25 where we make the k parameter tend to infinity. in comparison with bm25, all constraints are satisfied under the conditions specified for bm25, except ewc2 and ewc3, which are not satisfied at all for ebm25. in the bm25 model, under the conditions of ewc2, the k parameter establishes how f u (v) grows as a function of the weight of the only common neighbor between the target and candidate users. the greater the value of k, the more the growth function approximates a linear function. when k → ∞, the growth becomes linear, and as a consequence, the model does not meet the ewc2 constraint. a similar issue occurs with ewc3. beyond the ir models, other approaches such as adamic-adar or mcn do operate at distance 2. in the particular case of these methods, they consider neither weights nor any means of normalization; only ewc3 and clnc2 are applicable here: under the conditions of ewc3, both methods just measure the number of common neighbors, satisfying the constraint. for clnc2, if we multiply all the weights of the link for a candidate by any number k = 0, the score of the functions would not vary (and, consequently, they meet the axiom). we summarize this analysis in table 2 , where we identify whether a method satisfies (fully or conditionally) or not the different axioms. in the case of the models not described in this section (pivoted normalization vsm, pl2, qld), we refer to the article by fang et al. [13] for further information on the conditions to satisfy the axioms. next, we empirically analyze whether satisfying the axioms leads to an improvement of the performance of such algorithms. prior work on axiomatic thinking [12, 13] has analyzed to which extent the satisfaction of a suitable set of constraints correlates with effectiveness. this is also a mechanism to validate such constraints, showing that it is useful to predict, explain or diagnose why an ir system is working well or badly. taking up this perspective, we undertake next such an empirical analysis of constraints in the contact recommendation setting, using a set of friends-of-friends algorithms. data: we use different network samples from twitter and facebook: the ego-facebook network released in the stanford large network dataset collection [24] , and two twitter data downloads described in [34] as 1-month and 200-tweets. the twitter downloads include each two different sets of edges for the same set of users: the follow network (where (u, v) ∈ e if u follows v), and the interaction network (where (u, v) ∈ e if u retweeted or mentioned v). the datasets are described in more detail in [32] [33] [34] . for evaluation purposes, we partition each network into a training graph that is supplied as input to the recommendation algorithms, and a test graph that is held out for evaluation. using the test graph, ir metrics such as precision, recall or ndcg can be computed, as well as other accuracy metrics such as auc [15] , by considering test edges as binary relevance judgements: a user v is relevant to a user u if -and only if -the edge (u, v) appears in the test graph. we further divide the training graph into a smaller training graph and a validation graph for parameter tuning. table 4 shows the size of the different resulting subgraphs. for all twitter networks, temporal splits are applied: the training data includes edges created before a given time, and the test set includes links created afterwards. edges appearing in both sides of the split are removed from the test network. for the interaction network, two different temporal points are selected to generate the split: july 5 th and july 12 th in the 1-month dataset, and july 24 th and july 29 th in 200-tweets. weights for the training graphs were computed by counting the number of interactions before the splits. for the follow networks, the edges between the users of the interaction network were downloaded three times: the first download is used as training graph for parameter tuning; the new links in the second snapshot (not present in the initial one), downloaded four months later, are used as the validation set; the complete second snapshot is given as input to the recommendation algorithms under evaluation; finally, the new edges in the third download (not present in the second), obtained two years afterwards, are used as the test data for evaluation. for the facebook data, since temporal information is not available, we apply a simple random split: 80% of links are sampled as training and 20% as test; within the training data, we use 25% of the edges as the validation subset. we focus on contact recommendation approaches that recommend users at distance 2. from that set, as representative ir models, we include adaptations for the pivoted normalization vector space model [36] ; bir and bm25 [29] as probabilistic models based on the probability ranking principle; query likelihood [27] with jelinek-mercer [20] , dirichlet [23] and laplace [39] smoothing as language models; and pl2 [3, 7] , dfree, dfreeklim [6] , dph [4] and dlh [5] as divergence from randomness approaches. in addition, we include adaptations of a number of link prediction methods [22] (following [34] ): adamic-adar [1] , jaccard [19] , most common neighbors [22] and cosine similarity [31] . 2 we start by analyzing the edge weight constraints. since weights are binary in the twitter follow graphs and facebook, we focus here on interaction graphs, where the interaction frequency provides a natural basis for edge weighting. a first natural question that arises when we study these axioms is whether the weights are useful or not for providing good recommendations. this is equivalent to test the importance of the first axiom for the contact recommendation task. to answer that question, we compare the two options (binarized vs. not binarized weights) in all algorithms which make use of weights: cosine similarity between users and all the ir models except bir. we show the results in fig. 1(a) , where each dot represents a different approach. in the x axis, we show the ndcg@10 value for the unweighted approaches, whereas the y axis shows ndcg@10 for the weighted ones. we can see that using weights results in an inferior performance in all algorithms except for bm25 and the simple cosine similarity. these observations suggest that ewc1 does not appear to be a reliable heuristic for contact recommendation in networks. however, once the weight is important for a model (and, therefore, ewc1 is important) does satisfying the rest of the edge weight constraints provide more accurate recommendations? to check that, similarly to fang et al. [12, 13] , we compare an algorithm that satisfies all three ewcs (and benefits from weights) with another one that does not satisfy ewc2 and ewc3: we compare bm25 vs. ebm25. fixing the k parameter for the bm25 model (using the optimal configuration from our experiments), we compare different parameter configurations for bm25 and ebm25. results are shown in fig. 1(b) , where every dot in the plot corresponds to a different model configuration, the x axis represents the ndcg@10 values for bm25, and the y axis those of the ebm25 model. as it can be observed, ebm25 does not improve over bm25 for almost every configuration (dots are all below the y = x plane), thus showing that, as long as ewc1 is important for the model, both ewc2 and ewc3 are relevant. as explained in sect. 4, ewc3 can also be satisfied independently of ewc1 and ewc2, so we finally check its importance. for that purpose, we address the following question: for any friends-of-friends algorithm, such as adamic-adar [1] or the ir models, is it beneficial to reward the number of common users between the target and the candidate users? to analyze this, we compare the mcn approach (which satisfies the constraint) with a binarized version of mcn which returns all people at distance 2 regardless of the common neighbor count. restricting the test set to people at distance 2, table 5 shows the resulting auc [15] of the mcn algorithm, averaged over users on each network. under these conditions, the binarized version would have an auc value of 0.5. hence, our results show that the number of common neighbors seem to be a strong signal for providing accurate recommendations (and, therefore, ewc3 seems to be important on its own for the contact recommendation task). neighbor discrimination constraint (ndc): as previously explained, this constraint suggests penalizing highly popular common neighbors. in ir approaches, this constraint is satisfied or not depending on the presence or absence of a term discrimination element (such as the robertson-spärck-jones in bm25/ebm25 or the p c (t) term in query likelihood approaches). therefore, to check the effectiveness benefit of this axiom, we compare -in terms of ndcg@10 -the bm25, ebm25, qld, qljm and the pivoted normalization vsm models with variants of them that lack term discrimination. figure 2 shows the difference between different variants of each model. in the figure, a positive value indicates that the original version (with term discrimination) performs better. we observe that in an overwhelming majority of points the original versions achieve a better accuracy, hence ndc appears to be key to providing good contact recommendations. this confirms the hypothesis in many recommendation approaches that using high-degree users to discriminate which users are recommended does not seem to be a good idea [1, 43] . finally, we study the effect of normalizing by candidate user length. for that purpose, similarly to the previous section, we compare the bm25, ebm25, qljm, qld and the pivoted normalization vsm models with versions of the models lacking the normalization by the candidate user length (which do not satisfy clnc1 and ew-clnc) using ndcg@10. we show a graph showing the differences in accuracy between different variants of the algorithms in fig. 3(a) . since there are few differences between datasets, we only show results for the interactions network of the twitter 1-month dataset. in the figure, we observe an opposite trend to what was expected: instead of performing worse, the algorithms without normalization do improve the results. therefore, it seems that the different length normalization constraints are not useful for contact recommendation. these observations are consistent with the preferential attachment phenomenon in social networks [8] , whereby high-degree users are more likely to receive new links than long-tail degree users. as an example, we check this in fig. 3(b) , where we compare the performances of the recommendation approaches listed in section 6.1 with the average in-degree, out-degree and (undirected) degree of the recommended people. we observe that, in general, in-degree and degree are clearly correlated with the performances of the methods, as the principle indicates. with out-degree this is not so clear though. this explains the few configurations in fig. 3 (a) that do not improve when we remove the normalization: all of them normalize by the sum of the weights of the outgoing links of the candidate users. similar trends are observed in other networks. we have theoretically and empirically analyzed the importance of the fundamental ir axioms for the contact recommendation task in social networks. theoretically, we have translated the different axioms proposed in [13] to the contact recommendation task, and we have checked whether the mapping introduced in [34] is sound and complete. we have found that, in general, the properties of the ir models are held in the recommendation task when we apply this mapping, unless we use a different definition for the document length from the usual. empirically, we have conducted several experiments over various twitter and facebook networks to check if those axioms have any positive effect on the accuracy of the recommenders. we showed that satisfying the constraints related to term frequencies and term discrimination have a positive impact on the accuracy. however, those related to length normalization tend to have the opposite effect, as they interfere with a basic evolutionary principle of social networks, namely preferential attachment [8] . friends and neighbors on the web toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions probability information models for retrieval based on divergence from randomness frequentist and bayesian approach to information retrieval fub, iasi-cnr and university of tor vergata at trec 2007 blog track fub, iasi-cnr, univaq at trec 2011 microblog track probabilistic models of information retrieval based on measuring the divergence from randomness emergence of scaling in random networks information filtering and information retrieval: two sides of the same coin? bridging memory-based collaborative filtering and text retrieval a theoretical analysis of pseudo-relevance feedback models a formal study of information retrieval heuristics diagnostic evaluation of information retrieval models semantic term matching in axiomatic approaches to information retrieval an introduction to roc analysis the whoto-follow system at twitter: strategy, algorithms, and revenue impact recommending twitter users to follow using content and collaborative filtering approaches collaborative filtering for implicit feedback datasets étude comparative de la distribution florale dans une portion des alpes et des jura interpolated estimation of markov source parameters from sparse data matrix factorization techniques for recommender systems the link-prediction problem for social networks a hierarchical dirichlet language model learning to discover social circles in ego networks clustering and preferential attachment in growing networks networks: an introduction, 1st edn a language modeling approach to information retrieval an axiomatic approach to diagnosing neural ir models the probabilistic relevance framework: bm25 and beyond an axiomatic approach to regularizing neural ranking models a vector space model for automatic indexing contact recommendations in social networks enhancing structural diversity in social networks by recommending weak ties information retrieval models for contact recommendation in social networks gravitation-based model for information retrieval proceedings of the 7th text retrieval conference (trec 1998) social recommendation: a review an exploration of proximity measures in information retrieval axiomatic analysis of language modelling of recommender systems finding and analysing good neighbourhoods to improve collaborative filtering. knowl.-based syst algorithms for estimating relative importance in networks a study of smoothing methods for language models applied to information retrieval predicting missing links via local information acknowledgements. j. sanz-cruzado and p. castells were partially supported by the spanish government (tin2016-80630-p). c. macdonald and i. ounis were partially supported by the european community's horizon 2020 programme, under grant agreement n ō 779747 entitled bigdatastack. key: cord-020793-kgje01qy authors: suominen, hanna; kelly, liadh; goeuriot, lorraine; krallinger, martin title: clef ehealth evaluation lab 2020 date: 2020-03-24 journal: advances in information retrieval doi: 10.1007/978-3-030-45442-5_76 sha: doc_id: 20793 cord_uid: kgje01qy laypeople’s increasing difficulties to retrieve and digest valid and relevant information in their preferred language to make health-centred decisions has motivated clef ehealth to organize yearly labs since 2012. these 20 evaluation tasks on information extraction (ie), management, and information retrieval (ir) in 2013–2019 have been popular—as demonstrated by the large number of team registrations, submissions, papers, their included authors, and citations (748, 177, 184, 741, and 1299, respectively, up to and including 2018)—and achieved statistically significant improvements in the processing quality. in 2020, clef ehealth is calling for participants to contribute to the following two tasks: the 2020 task 1 on ie focuses on term coding for clinical textual data in spanish. the terms considered are extracted from clinical case records and they are mapped onto the spanish version of the international classification of diseases, the 10th revision, including also textual evidence spans for the clinical codes. the 2020 task 2 is a novel extension of the most popular and established task in clef ehealth on chs. this ir task uses the representative web corpus used in the 2018 challenge, but now also spoken queries, as well as textual transcripts of these queries, are offered to the participants. the task is structured into a number of optional subtasks, covering ad-hoc search using the spoken queries, textual transcripts of the spoken queries, or provided automatic speech-to-text conversions of the spoken queries. in this paper we describe the evolution of clef ehealth and this year’s tasks. the substantial community interest in the tasks and their resources has led to clef ehealth maturing as a primary venue for all interdisciplinary actors of the ecosystem for producing, processing, and consuming electronic health information. substantial community interest in the tasks and their resources has led to clef ehealth maturing as a primary venue for all interdisciplinary actors of the ecosystem for producing, processing, and consuming electronic health information. keywords: ehealth · medical informatics · information extraction · information storage and retrieval · speech recognition improving the legibility of electronic health record (ehr) can contribute to patients' right to be informed about their health and health care. the requirement to ensure that patients can understand their own privacy-sensitive, official health information in their ehr are stipulated by policies and laws. for example, the declaration on the promotion of patients' rights in europe by world health organization (who) from 1994 obligates health care workers to communicate in a way appropriate to each patient's capacity for understanding and give each patient a legible written summary of these care guidelines. this patient education must capture the patient's health status, condition, diagnosis, and prognosis, together with the proposed and alternative treatment/non-treatment with risks, benefits, and progress. patients' better abilities to understand their own ehr empowers them to take part in the related health/care judgment, leading to their increased independence from health care providers, better health/care decisions, and decreased health care costs [11] . improving patients' ability to digest this content could mean enriching the ehr-text with hyperlinks to term definitions, paraphrasing, care guidelines, and further supportive information on patientfriendly and reliable websites, and the enabling methods for such reading aids can also release health care workers' time from ehr-writing to, for example, longer patient-education discussions [14] . information access conferences have organized evaluation labs on related electronic health (ehealth) information extraction (ie), information management (im), and information retrieval (ir) tasks for almost 20 years. yet, with rare exception, they have targeted the health care experts' information needs only [1, 2, 6] . such exception, the clef ehealth evaluation-lab and lab-workshop series 1 has been organized every year since 2012 as part of the conference and labs of the evaluation forum (clef) [4, 5, [8] [9] [10] 13, 16, 17] . in 2012, the inaugural scientific clef workshop took place, and from 2013-2019 this annual workshop has been supplemented with a lead-up evaluation lab, consisting of, on average, three shared tasks each year (fig. 1) . although the tasks have been centered around the patients and their families' needs in accessing and understanding ehealth information, also automatic speech recognition (asr) and ie to aid clinicians in im were considered in 2015-2016 and in 2017-2019, tasks on technology assisted reviews to support health scientists and health care policymakers' information access were organized. this paper presents first an overview of clef ehealth lab series from 2012 to 2019 and introduces its 2020 evaluation tasks. then, it concludes by presenting our vision for clef ehealth beyond 2020. clef ehealth tasks offered yearly from 2013 have brought together researchers working on related information access topics, provided them with resources to work with and validate their outcomes, and accelerated pathways from scientific ideas to societal impact. in 2013, 2014, 2015, 2016, 2017, 2018, and 2019 as many as 170, 220, 100, 116, 67, 70, and 67 teams have registered their expression of interest in the clef ehealth tasks, respectively, and the number of teams proceeding to the task submission stage has been 53, 24, 20, 20, 32, 28, and 9, respectively [4, 5, [8] [9] [10] 16, 17] . 2 according to our analysis of the impact of clef ehealth labs up to 2017 [15] , the submitting teams have achieved statistically significant improvements in the processing quality in at least 1 out of the top-3 methods submitted to the following eight tasks: 3 [14] . clef ehealth 2012 lab workshop has resulted in 16 papers and each year clef ehealth 2013-2017 evaluation labs have increased this number from 31 to 35. in accordance with the clef ehealth mission to foster teamwork, the number of co-authors per paper has been from 1 to 15 (the mean and standard deviation of 4 and 3, respectively). in about a quarter of the papers, this co-authoring collaboration has been international, and sometimes even intercontinental. this substantial community interest in the clef ehealth tasks and their resources has led to the evaluation campaign maturing and establishing its presence over the years. in 2020, clef ehealth is one of the primary venues for all interdisciplinary actors of the ecosystem for producing, processing, and consuming ehealth information [1, 2, 6] . its niche is addressing health information needs of laypeople-and not health care experts only-in retrieving and digesting valid and relevant ehealth information to make health-centered decisions. the 2020 clef ehealth task 1 on ie, called codiesp supported by the spanish national plan for the advancement of language technology (plan tl), builds upon the five previous editions of the task in 2015-2019 [4, 5, 8, 10, 16] that have already addressed the analysis of biomedical text in english, french, hungarian, italian, and german. this year, the codiesp task, will focus on the international classification of diseases, the 10th revision (icd10) coding for clinical case data in spanish using the spanish version of icd10 (cie10). the codiesp task will explore the automatic assignment of cie10 codes to clinical case documents in spanish, namely of two categories: procedure and diagnosis (known as 'procedimiento' and 'diagnostico' in spanish). the following three subtasks will be posed: (1) codiesp diagnosis coding will consist of automatically assigning diagnosis codes to clinical cases in spanish. (2) codiesp procedure coding will focus on assigning procedure codes to clinical cases in spanish. (3) codiesp explainable artificial intelligence (ai) will evaluate the explainability/interpretability of the proposed systems, as well as their performance by requesting to return the text spans supporting the assignment of cie10 codes. the codiesp corpus used for this task consists of a total of 1, 000 clinical cases that were manually annotated by clinical coding professionals with clinical procedure and diagnosis codes from the spanish version of icd10 together with the actual minimal text spans supporting the clinical codes. the codiesp corpus has around 18, 000 sentences, and contains about 411, 000 words and 19, 000 clinical codes. code annotations will be released in a separate file together with the respective document code and the span of text that leads to the codification (the evidence). additional data resources including medical literature abstracts in spanish indexed with icd10 codes, linguistic resources, gazetteers, and a background set of medical texts in spanish will also be released to complement the codiesp corpus, together with annotation guidelines and details. for the codiesp diagnosis and procedure coding subtasks, participants will submit their coding predictions returning ranked results. for every document, a list of possible codes will be submitted, ordered by confidence or relevance. since these subtasks are designed to be ranking competitions, they will be evaluated on a standard ranking metric: mean average precision. for the codiesp explainable ai subtask, explainability of the systems will be considered, in addition to their performance on the test set. systems have to provide textual evidence from the clinical case documents that supports the code assignment and thus can be interpreted by humans. this automatically returned evidence will be evaluated against manually annotated text spans. true positive evidence texts are those that consist in a sub-match of the manual annotations. f 1 will be used as the primary evaluation metric. the 2020 clef ehealth task 2 on ir builds on the tasks that have run at clef ehealth since its inception in 2012. this consumer health search (chs) task follows a standard ir shared challenge paradigm from the perspective that it provides participants with a test collection consisting of a set of documents and a set of topics to develop ir techniques for. runs submitted by participants are pooled, and manual relevance assessments conducted. performance measures are then returned to participants. in the 2017 clef ehealth chs task, similarly to 2016, we used the clueweb 12 b13 4 document collection [12, 18] . this consisted of a collection of 52.3 million medically related web pages. given the scale of this document collection participants reported that it was difficult to store and manipulate the document collection. in response, the 2018 chs task introduced a new document collection, named clefehealth2018. this collection consists of over 5 million medical webpages from selected domains acquired from the commoncrawl [7] . given the positive feedback received for this document collection, it will be used again in the 2020 chs task. historically the clef ehealth ir task has released text queries representative of layperson information needs in various scenarios. in recent years, query variations issued by multiple laypeople for the same information need have been offered. in this year's task we extend this to spoken queries. these spoken queries are generated by 6 individuals using the information needs derived for the 2018 challenge [7] . we also provide textual transcripts of these spoken queries and asr translations. given the query variants for an information need, participants are challenged in the 2020 task with retrieving the relevant documents from the provided document collection. this is divided into a number of subtasks which can be completed using the spoken queries or their textual transcripts by hand or asr. similar to the 2018 chs tasks, subtasks explored this year are: adhoc/personalized search, query variations, and search intent with binary preference, mean reciprocal rank, normalized discounted cumulative gain@1-10, and (understandability-biased) rank-biased precision as subtask-dependent evaluation measures. participants can submit multiple runs for each subtask. the general purpose of clef ehealth throughout the years, as its 2020 ie and ir tasks demonstrate, has been to assist laypeople in finding and understanding health information in order to make enlightened decisions. breaking language barriers has been our priority over the years, and this will continue in our multilingual tasks. text has been our major media of interest, but speech has been, and continues to be, included in tasks as a major new way of interacting with systems. each year of the labs has enabled the identification of difficulties and challenges in ie, im, and ir which have shaped our tasks. for example, popular ir tasks have considered multilingual, contextualized, and/or spoken queries and query variants. however, further exploration of query construction, aiming at a better understanding of chs are still needed. the task into the future will also further explore relevance dimensions, and work toward a better assessment of readability and reliability, as well as methods to take these dimensions into consideration. as lab organizers, our purpose is to increase the impact and the value of the resources, methods and the community built by clef ehealth. examining the quality and stability of the lab contributions will help the clef ehealth series to better understand where it should be improved and how. as future work, we intend continuing our analyses of the influence of the clef ehealth evaluation series from the perspectives of publications and data/software releases [3, 14, 15] . aspiring to unintended consequences of natural language processing: a review of recent developments in clinical and consumergenerated text processing advancing the state of the art in clinical natural language processing through shared tasks an analysis of evaluation campaigns in ad-hoc medical information retrieval: clef ehealth overview of the clef ehealth evaluation lab clef 2017 ehealth evaluation lab overview community challenges in biomedical text mining over 10 years: success, failure and the future overview of the clef 2018 consumer health search task overview of the clef ehealth evaluation lab overview of the share/clef ehealth evaluation lab overview of the clef ehealth evaluation lab patient empowerment: the need to consider it as a measurable patient-reported outcome for chronic conditions working notes of conference and labs of the evaluation (clef) forum. ceur workshop proceedings clefehealth2012 -the clef 2012 workshop on cross-language evaluation of methods, applications, and resources for ehealth document analysis working notes scholarly influence of the conference and labs of the evaluation forum ehealth initiative: review and bibliometric study of the 2012 to 2017 outcomes information retrieval evaluation in a changing world: lessons learned from 20 years of clef overview of the clef ehealth evaluation lab overview of the share/clef ehealth evaluation lab the ir task at the clef ehealth evaluation lab 2016: usercentred health information retrieval key: cord-020848-nypu4w9s authors: morris, david; müller-budack, eric; ewerth, ralph title: slideimages: a dataset for educational image classification date: 2020-03-24 journal: advances in information retrieval doi: 10.1007/978-3-030-45442-5_36 sha: doc_id: 20848 cord_uid: nypu4w9s in the past few years, convolutional neural networks (cnns) have achieved impressive results in computer vision tasks, which however mainly focus on photos with natural scene content. besides, non-sensor derived images such as illustrations, data visualizations, figures, etc. are typically used to convey complex information or to explore large datasets. however, this kind of images has received little attention in computer vision. cnns and similar techniques use large volumes of training data. currently, many document analysis systems are trained in part on scene images due to the lack of large datasets of educational image data. in this paper, we address this issue and present slideimages, a dataset for the task of classifying educational illustrations. slideimages contains training data collected from various sources, e.g., wikimedia commons and the ai2d dataset, and test data collected from educational slides. we have reserved all the actual educational images as a test dataset in order to ensure that the approaches using this dataset generalize well to new educational images, and potentially other domains. furthermore, we present a baseline system using a standard deep neural architecture and discuss dealing with the challenge of limited training data. convolutional neural networks (cnns) are making great strides in computer vision, driven by large datasets of annotated photos, such as imagenet [1] . many images relevant for information retrieval, such as charts, tables, and diagrams, are created with software rather than through photography or scanning. there are several applications in information retrieval for a robust classifier of educational illustrations. search tools might directly expose filters by predicted label, natural language systems could choose images by type based on what information a user is seeking. further analysis systems could be used to extract more information from an image to be indexed based on its class. in this case, we have classes such as pie charts and x-y graphs that indicate what type of information is in the image (e.g., proportions, or the relationship of two numbers) and how it is symbolized (e.g., angular size, position along axes). most educational images are created with software and are qualitatively different from photos and scans. neural networks designed and trained to make sense of the noise and spatial relationships in photos are sometimes suboptimal for born-digital images and educational images in general. educational images and illustrations are under-served in training datasets and challenges. competitions such as the contest on robust reading for multi-type web images [2] and icdar detext [3] have shown that these tasks are difficult and unsolved. research on text extraction such as morris et al. [4] and nayef and ogier [5] has shown that even noiseless born-digital images are sometimes better analyzed with neural nets than with handcrafted features and heuristics. born-digital and educational images need further benchmarks on challenging information retrieval tasks in order to test generalization. in this paper, we introduce slideimages, a dataset which targets images from educational presentations. most of these educational illustrations are created with diverse software, so the same symbols are drawn in different ways in different parts of the image. as a result, we expect that effective synthetic datasets will be hard to create, and methods effective on slideimages will generalize well to other tasks with similar symbols. slideimages contains eight classes of image types (e.g. bar charts and x-y plots) and a class for photos. the labels we have created were made with information extraction for image summarization in mind. in the rest of this paper, we discuss related work in sect. 2, details about our dataset and baseline method in sect. 3, results of our baseline method in sect. 4, and conclude with a discussion of potential future developments in sect. 5. prior information retrieval publications used or could use document figure classification. charbonnier et al. [6] built a search engine with image type filters. aletras and mittal [7] automatically label topics in photos. kembhavi et al.'s [8] diagram analysis assumes the input figure is a diagram. hiippala and orekhova extended that dataset by annotating it in terms of relational structure theory, which implies that the same visual features communicate the same semantic relationships. de herrera et al. [9] seek to classify image types to filter their search for medical professionals. we intend to use document figure classification as a first step in automatic educational image summarization applications. a similar idea is followed by morash et al. [10] , who built one template for each type of image, then manually classified images and filled out the templates, and suggested automating the steps of that process. moraes et al. [11] mentioned the same idea for their sight (summarizing information graphics textually) system. a number of publications on document image classification such as afzal et al. [12] and harley et al. [13] use the rvl-cdip (ryerson vision lab complex document information processing) dataset, which covers scanned documents. while document scans and born-digital educational illustrations have materially different appearance, these papers show that the utility of deep neural networks is not limited to scene image tasks (fig. 1) . a classification dataset of scientific illustrations was created for the noa project [14] . however, their dataset is not publicly available, and does not draw as many distinctions between types of educational illustrations. jobin et al.'s docfigure [15] consists of 28 different categories of illustrations extracted from scientific publications totaling 33,000 images. techniques that work well on docfigure [15] do not generalize to the educational illustrations in our use case scenarios (as we also show in sect. 4.2). different intended uses or software cause sufficient differences in illustrations that a dataset of specifically educational illustrations is needed. cnns and related techniques are heavily data driven. an approach must consist of both an architecture and optimization technique, but also the data used for that optimization. in our case, we consider the dataset our main contribution. when building our taxonomy, we have chosen classes such that one class would have the same types of salient features, and appropriate summaries would also be similar in structure. our classes are also all common in educational materials. beyond the requirements of our taxonomy, our datasets needed to be representative of common educational illustrations in order to fit real-world applications, and legally shareable to promote research on educational image classification. educational illustrations are created by a variety of communities with varying expertise, techniques, and tools, so choosing a dataset from one source may eliminate certain variables in educational illustration. to identify these variables, we kept our training and test data sources separate. we assembled training and validation datasets from various sources of open access illustrations. bar charts, x-y plots, maps, photos, pie charts, slide images, table images, and technical drawings were manually selected by a student assistant (supported by the main author) using the wikimedia commons image search for related terms. we manually selected graph diagrams, which we also call node-edge diagrams or "structured diagrams," from the kembhavi et al. [8] allenai diagram understanding (ai2d) dataset; not all ai2d images contain graph edges [8] . the training dataset of slideimages consists of 2,938 images and is intended for fine-tuning cnns, not training from scratch. the slideimages test set is derived from a snapshot of slidewiki open educational resource platform (https://slidewiki.org/) datastore obtained in 2018. from that snapshot, two annotators manually selected and labeled 691 images. our data are available at our code repository: https://github.com/david-morris/slideimages/. the slideimages training dataset is small compared to datasets like imagenet [1] , with over 14 million images, rvl-cdip [13] with 400,000 images, or even docfigure [15] with 33,000 images. much of our methodology is shaped by needing to confront the challenges of a small dataset. in particular, we aim to avoid overfitting: the tendency of a classifier to identify individual images and patterns specific to the training set rather than the desired semantic concepts. for our pre-training dataset, a large, diverse dataset is required that contains a large proportion of educational and scholarly images. we pre-trained on a dataset of almost 60,000 images labeled by sohmen et al. [6] (noa dataset), provided by the authors on request. the images are categorized as composite images, diagrams, medical imaging, photos, or visualizations/models. to mitigate overfitting, we used data augmentation: distorting an image while keeping relevant traits. we used image stretching, brightness scaling, zooming, and color channel shifting as shown in our source code. we also added dropout with a rate of 0.1 on the extracted features before the fully connected and output layers. we used similar image augmentation for pre-training and training. we use mobilenetv2 [16] as our network architecture. we chose mobilenetv2 as a compromise between a small number of parameters and performance on ima-genet. intuitively, a smaller parameter space implies a model with more bias and lower variance, which is better for smaller datasets. we initialized our weights from an imagenet model and pre-trained for a further 40 epochs with early stopping on the noa dataset using the adam (adaptive moment estimation) [17] optimizer. this additional pre-training was intended to cause the lower levels of the network to extract more features specific to born-digital images. we then trained for 40 epochs with adam and a learning rate schedule. our schedule drops the learning rate by a factor of 10 at the 15th and 30th epoch. our implementation is available at https://github.com/david-morris/slideimages/. we have performed two experiments, in order to show that this dataset represents a meaningful improvement over existing work, and to establish a baseline. because our classes are unbalanced, we have reported summary statistics as accuracy averages of each class weighted by number of instances per class. we set a baseline for our dataset with the classifier described in sect. 3.2. the confusion matrix in fig. 2 shows that misclassifications do tend towards a few types of errors, but none of the classes have collapsed. while certain classes are likely to be misclassified as another specific class (such as structured diagrams as slides), those relationships do not happen in reverse, and a correct classification is more likely. figure 2 shows that our baseline leaves room for improvement, and our test set helps to identify challenges in this task. viewing individual classification errors highlighted a few problems with our training data. our training the related docfigure dataset covers similar images and has much more data than slideimages. to justify slideimages, we have created a head-to-head comparison of classifiers trained in the same way (as described in sect. 3.2) on the slideimages and docfigure datasets. all the slideimages classes except slides have an equivalent in docfigure. we have shown the reduction in the data used, and the relative sizes of the datasets, in table 1 . the head-to-head datasets contain only the matching classes, and in the case of the docfigure dataset, the original test set has been split into validation and test sets. after obtaining the two trained networks, we have tested each network on both the matching test set, and the other test set. although we were unable to reproduce the vgg-v baseline used by jobin et al., we used a linear svm with vgg-16 features and achieved comparable results on the full docfigure dataset (90% macro average compared to their 88.96% with a fully neural feature extractor). the results ( table 2) show that slideimages is a more challenging and potentially more general task. the net trained on slideimages did even better on the docfigure test set than on the slideimages test set. despite having a different source and approximately a fifth of the size of the docfigure dataset, the net trained on slideimages training set was better on our test set. in this paper, we have presented the task of classifying educational illustrations and images in slides and introduced a novel dataset slideimages. the classification remains an open problem despite our baseline and represents a useful task for information retrieval. we have provided a test set derived from actual educational illustrations, and a training set compiled from open access images. finally, we have established a baseline system for the classification task. other potential avenues for future research include experimenting with the docfigure dataset in the pre-training and training phases, and experimenting with text extraction for multimodal classification. imagenet: a large-scale hierarchical image database icpr2018 contest on robust reading for multi-type web images icdar2017 robust reading challenge on text extraction from biomedical literature figures (detext) a neural approach for text extraction from scholarly figures semantic text detection in born-digital images via fully convolutional networks noa: a search engine for reusable scientific images beyond the life sciences labeling topics with images using a neural network a diagram is worth a dozen images semi-supervised learning for image modality classification guiding novice web workers in making image descriptions using templates evaluating the accessibility of line graphs through textual summaries for visually impaired users cutting the error by half: investigation of very deep cnn and advanced training strategies for document image classification evaluation of deep convolutional nets for document image classification and retrieval figures in scientific open access publications docfigure: a dataset for scientific document figure classification 2018 ieee conference on computer vision and pattern recognition. cvpr 2018 adam: a method for stochastic optimization acknowledgement. this work is financially supported by the german federal ministry of education and research (bmbf) and european social fund (esf) (inclu-siveocw project, no. 01pe17004). key: cord-020912-tbq7okmj authors: batra, vishwash; haldar, aparajita; he, yulan; ferhatosmanoglu, hakan; vogiatzis, george; guha, tanaya title: variational recurrent sequence-to-sequence retrieval for stepwise illustration date: 2020-03-17 journal: advances in information retrieval doi: 10.1007/978-3-030-45439-5_4 sha: doc_id: 20912 cord_uid: tbq7okmj we address and formalise the task of sequence-to-sequence (seq2seq) cross-modal retrieval. given a sequence of text passages as query, the goal is to retrieve a sequence of images that best describes and aligns with the query. this new task extends the traditional cross-modal retrieval, where each image-text pair is treated independently ignoring broader context. we propose a novel variational recurrent seq2seq (vrss) retrieval model for this seq2seq task. unlike most cross-modal methods, we generate an image vector corresponding to the latent topic obtained from combining the text semantics and context. this synthetic image embedding point associated with every text embedding point can then be employed for either image generation or image retrieval as desired. we evaluate the model for the application of stepwise illustration of recipes, where a sequence of relevant images are retrieved to best match the steps described in the text. to this end, we build and release a new stepwise recipe dataset for research purposes, containing 10k recipes (sequences of image-text pairs) having a total of 67k image-text pairs. to our knowledge, it is the first publicly available dataset to offer rich semantic descriptions in a focused category such as food or recipes. our model is shown to outperform several competitive and relevant baselines in the experiments. we also provide qualitative analysis of how semantically meaningful the results produced by our model are through human evaluation and comparison with relevant existing methods. there is growing interest in cross-modal analytics and search in multimodal data repositories. a fundamental problem is to associate images with some corresponding descriptive text. such associations often rely on semantic understanding, beyond traditional similarity search or image labelling, to provide humanlike visual understanding of the text and reflect abstract ideas in the image. stepwise recipe illustration example showing a few text recipe instruction steps alongside one full sequence of recipe images. note that retrieval of an accurate illustration of step 4, for example, depends on previously acquired context information. cross-modal retrieval systems must return outputs of one modality from a data repository, while a different modality is used as the input query. the multimodal repository usually consists of paired objects from two modalities, but may be labelled or unlabelled. classical approaches to compare data across modalities include canonical correlation analysis [12] , partial least squares regression [28] , and their numerous variants. more recently, various deep learning models have been developed to learn shared embedding spaces from paired image-text data, either unsupervised, or supervised using image class labels. the deep models popularly used include deep belief networks [23] , correspondence autoencoders [9] , deep metric learning [13] , and convolutional neural networks (cnns) [33] . with all these models it is expected that by learning from pairwise aligned data, the common representation space will capture semantic similarities across modalities. most such systems, however, do not consider sequences of related data in the query or result. in traditional image retrieval using text queries, for example, each image-text pair is considered in isolation ignoring any broader 'context'. a context-aware image-from-text retrieval model must look at pairwise associations and also consider sequential relationships. such sequence-to-sequence (seq2seq) cross-modal retrieval is possible when contextual information and semantic meaning are both encoded and used to inform the retrieval step. for stepwise recipe illustration, an effective retrieval system must identify and align a set of relevant images corresponding to each step of a given text sequence of recipe instructions. more generally, for the task of automatic story picturing, a series of suitable images must be chosen to illustrate the events and abstract concepts found in a sequential text taken from a story. an example of the instruction steps and illustrations of a recipe taken from our new stepwise recipe dataset is shown in fig. 1 . in this paper, we present a variational recurrent learning model to enable seq2seq retrieval, called variational recurrent sequence-to-sequence (vrss) model. vrss produces a joint representation of the image-text repository, where the semantic associations are grounded in context by making use of the sequential nature of the data. stepwise query results are then obtained by searching this representation space. more concretely, we incorporate the global context information encoded in the entire text sequence (through the attention mechanism) into a variational autoencoder (vae) at each time step, which converts the input text into an image representation in the image embedding space. to capture the semantics of the images retrieved so far (in a story/recipe), we assume the prior of the distribution of the topic given the text input follows the distribution conditional on the latent topic from the previous time step. by doing so, our model can naturally capture sequential semantic structure. our main contributions can be summarised below: -we formalise the task of sequence-to-sequence (seq2seq) retrieval for stepwise illustration of text. -we propose a new variational recurrent seq2seq (vrss) retrieval model for seq2seq retrieval, which employs temporally-dependent latent variables to capture the sequential semantic structure of text-image sequences. -we release a new stepwise recipe dataset (10k recipes, 67k total imagetext pairs) for research purposes, and show that vrss outperforms several cross-modal retrieval alternatives on this dataset, using various performance metrics. our work is related to: cross-modal retrieval, story picturing, variational recurrent neural networks, and cooking recipe datasets. a number of pairwise-based methods over the years have attempted to address the cross-modal retrieval problem in different ways, such as metric learning [26] and deep neural networks [32] . for instance, an alignment model [16] was devised that learns inter-modal correspondences using ms-coco [19] and flickr-30k [25] datasets. other work [18] proposed unifying joint image-text embedding models with multimodal neural language models, using an encoder-decoder pipeline. a later method [8] used hard negatives to improve their ranking loss function, which yielded significant gains in retrieval performance. such systems focus only on isolated image retrieval when given a text query, and do not address the seq2seq retrieval problem that we study here. in a slight variation [2] , the goal was to retrieve an image-text multimodal unit when given a text query. for this, they proposed a gated neural architecture to create an embedding space from the query texts and query images along with the multimodal units that form the retrieval results set, and then performed semantic matching in this space. the training minimized structured hinge loss, and there was no sequential nature to the data used. picturing. an early story picturing system [15] retrieved landscape and art images to illustrate ten short stories based on key terms in the stories and image descriptions as well as a similarity linking of images. the idea was pursued further with a system [11] for helping people with limited literacy to read, which split a sentence into three categories and then retrieved a set of explanatory pictorial icons for each category. to our knowledge, an application [17] that ranks and retrieves image sequences based on longer text paragraphs as queries was the first to extend the pairwise image-text relationship to matching image sequences with longer paragraphs. they employed a structural ranking support vector machine with latent variables and used a custom-built disneyland dataset, consisting of blog posts with associated images as the parallel corpus from which to learn joint embeddings. we follow a similar approach, creating our parallel corpus from sequential stepwise cooking recipes rather than unstructured blog posts, and design an entirely new seq2seq model to learn our embeddings. the visual storytelling dataset (vist) [14] was built with a motivation similar to our own, but for generating text descriptions of image sequences rather than the other way around. relying on human annotators to generate captions, vist contains sequential image-text pairs with a focus on abstract visual concepts, temporal event relations, and storytelling. in our work, we produce a similar sequenced dataset in a simple, automated manner. a recent joint sequence-to-sequence model [20] learned a common image-text semantic space and generated paragraphs to describe photo streams. this bidirectional attention recurrent neural network was evaluated on both the above datasets. despite being unsuitable for our inverse problem, vist has also been used for retrieving images when given text, in work related to ours. in an approach called coherent neural story illustration (cnsi), an encoder-decoder network [27] was built to first encode sentences using a hierarchical two-level sentence-story gated recurrent unit (gru), and then sequentially decode into a corresponding sequence of illustrative images. a previously proposed coherence model [24] was used to explicitly model co-references between sentences. variational recurrent neural networks. our model is partly inspired by the variational recurrent neural network (vrnn) [6] , which introduces latent random variables into the hidden state of an rnn by combining it with a variational autoencoder (vae). they showed that using high level latent random variables, vrnn can model the variability observed in structured sequential data such as natural speech and handwriting. vrnn has recently been applied to other sequential modelling tasks such as machine translation [31] . our proposed vrss model introduces temporally-dependent latent variables to capture the sequential semantic structure of text/image sequences. different from existing approaches, we take into account the global context information encoded in the entire query sequence. we use vae for cross-modal generation by converting the text into a representation in the image embedding space instead of using it to reconstruct the text input. finally, we use the max-margin hinge loss to enforce similarity between text and paired image representations. cooking recipe datasets. the first attempt at automatic classification of food images was the food-101 dataset [3] having 101k images across 101 categories. since then, the new recipe1m dataset [29] gained wide attention, which paired each recipe with several images to build a collection of 13m food images for 1m recipes. recent work [4] proposed a cross-modal retrieval model that aligns recipe1m images and recipes in a shared representation space. as this dataset does not offer any sequential data for stepwise illustration, this association is between images of the final dish and the corresponding entire recipe text. our stepwise recipe dataset, by comparison, provides an image for each instruction step, resulting in a sequence of image-text pairs for each recipe. in [5] they release a dataset of sequenced image-text pairs in the cooking domain, with focus on text generation conditioned on images. recipeqa [34] is another popular dataset, used for multimodal comprehension and reasoning, with 36k questions about the 20k recipes and illustrative images for each step of the recipes. recent work [1] used it to analyse image-text coherence relations, thereby producing a human-annotated corpus with coherence labels to characterise different inferential relationships. the recipeqa dataset reveals associations between image-text pairs much like our stepwise recipe dataset, and we therefore utilise it to augment our own dataset. we construct the stepwise recipe dataset, composed of illustrated, step-bystep recipes from three websites 1 . recipes were automatically web-scraped and cleaned of html tags. the information about data and scripts will be made available on github 2 . the construction of such an image-text parallel corpus has several challenges as highlighted in previous work [17] . the text is often unstructured, without information about the canonical association between image-text pairs. each image is semantically associated with some portion of the text in the same recipe, and we assume that the images chosen by the author to augment the text are semantically meaningful. we thus perform text segmentation to divide the recipe text and associate segments with a single image each. we perform text-based filtering [30] to ensure text quality: (1) descriptions should have a high unique word ratio covering various part-of-speech tags, therefore descriptions with high noun ratio are discarded; (2) descriptions with high repetition of tokens are discarded; and (3) some predefined boiler-plate prefixsuffix sequences are removed. our constructed dataset consists of about 2k recipes with 44k associated images. furthermore, we augment our parallel corpus using similarly filtered recipeqa data [34] , which contains images for each step of the recipes in addition to visual question answering data. the final dataset contains over 10k recipes in total and 67k images. the seq2seq retrieval task is formalised as follows: given a sequence of text passages, x = {x 1 , x 2 , ..., x t }, retrieve a sequence of images i = {i 1 , i 2 , . .., i t } (from a data repository) which best describes the semantic meanings of the text passages, i.e., p(i|x) = we address the seq2seq retrieval problem by considering three aspects: (1) encoding the contextual information of text passages; (2) capturing the semantics of the images retrieved (in a story/recipe); and (3) learning the relatedness between each text passage and its corresponding image. it is natural to use rnns to encode a sequence of text passages. here, we encode a text sequence using a bi-directional gru (bi-gru). given a text passage, we use the attention mechanism to capture the contextual information of the whole recipe. we map the text embedding into a latent topic z t by using a vae. in order to capture the semantics of the images retrieved so far (in a story/recipe), we assume the prior of the distribution of the topic given the text input follows a distribution conditional on the latent topic z t−1 from the previous step. we decode the corresponding image vector i t conditional on the latent topic, to learn the relatedness between text and image with a multi-layer perceptron and obtain a synthetic image embedding point generated from its associated text embedding point. our proposed variational recurrent seq2seq (vrss) model is illustrated in fig. 2 . below, we describe each of the main components of the vrss model. we use a bi-gru to learn the hidden representations of the text passage (e.g. one recipe instruction) in the forward and backward directions. the two learned hidden states are then concatenated to form the text segment to encode a sequence of such text passages (e.g. one recipe), a hierarchical bi-gru is used which first encodes each text segment and subsequently combines them. image encoder. to generate the vector representation of an image, we use the pre-trained modified resnet50 cnn [22] . in experiments, this model produced a well distributed feature space when trained on the limited domain, namely food related images. this was verified using t-sne visualisations [21] , which showed less clustering in the generated embedding space as compared to embeddings obtained from models pre-trained on imagenet [7] . to capture global context, we feed the bi-gru encodings into a top level bi-gru. assuming the hidden state output of each text passage x l in the global context is h c l , we use an attention mechanism to capture its similarity with the hidden state output of the t th text passage h t as the context vector is encoded as the combination of l text passages weighted by the attentions as c t = l l=1 α l h c l . this ensures that any given text passage is influenced more by others that are semantically similar. at the t th step text x t of the text sequence, the bi-gru output h t is combined with the context c t and fed into a vae to generate the latent topic z t . two prior networks f μ θ and f σ θ define the prior distribution of z t conditional on the previous z t−1 . we also define two inference networks f μ φ and f σ φ which are functions of h t , c t , and z t−1 : unlike the typical vae setup where the text input x t is reconstructed by generation networks, here we generate the corresponding image vector i t . to generate the image vector conditional on z t , the generation networks are defined which are also conditional on z t−1 : the generation loss for image i t is then: − kl(q(z t |x ≤t , z