Iberica 13 Ibérica 34 (2017): 283-292 ISSN: 1139-7241 / e-ISSN: 2340-2784 Abstract Social media exchanges (for example, via Facebook or Twitter), blogs, and forums, amongst many other electronic genres, have come to be used as relatively bona fide testimonies of language use nowadays. The present paper discusses how such language data can be effectively employed in research carried out into languages for specific purposes. Particularly, it describes a project which aims to compile a representative number of texts (in this case, blog posts) in the Spanish fashion panorama in order explore the lexis occurring in the specialised language. It thus details the procedure and criteria followed to create the database, and it tackles some of the challenges which the reader/researcher may encounter in this enterprise. Keywords: corpora, Internet-based research, blogs, vocabulary in languages for specific purposes. Resumen S ig uie n do e l ri tmo d e l a e ra d ig it al : nue v as f ue nt e s d e d ato s e n l a i nv e s t ig a c ió n de l e ng ua s p a ra f i ne s e s pe c íf i c os Los intercambios en redes sociales (por ejemplo, a través de Facebook o Twitter), blogs y foros, entre otros muchos géneros digitales, en la actualidad han pasado a ser utilizados como testimonios relativamente fidedignos de la lengua. El presente estudio trata cómo tales fuentes pueden ser utilizadas de forma efectiva en la investigación sobre lenguas para fines específicos. En particular, describe un proyecto que pretende compilar un número representativo de textos (en este caso, posts de blog) en el panorama de la moda española para explorar el léxico que se da en esta lengua de especialidad. Detalla, pues, el procedimiento y criterios Keeping up with the digital age: New data sources in research on languages for specific purposes Amanda Roig-Marín University of Cambridge adr41@cam.ac.uk 283 Ibérica 34 (2017): 283-292 AMANDA ROIG-MARÍN seguidos para crear la base de datos y aborda algunos de los retos que el lector/investigador puede encontrar en esta tarea. Palabras clave: corpus, investigación basada en internet, blogs, vocabulario en lenguas para fines específicos. Introduction Curious as it may seem to digital natives – and even to digital immigrants – the use of online sources in academia still causes a certain degree of suspicion and bewilderment among non-specialists. Some recent headlines suggesting this by foregrounding the type of sources used read as follows: “Using Twitter, linguists find global ‘superdialects1’”, “Cambridge University linguists use Twitter to study how Welsh language use is changing2”, or “Status update language used to predict Facebook users’ age, gender, personality3”. Precisely because social media and the Internet in general play a crucial role in our world, they tend to record everyday usage as any other medium. This explains why a constantly increasing number of works have concentrated on the potential use of these tools for research (see, amongst others, Zappavigna, 2012; Page et al., 2014; Wang & Winstead, 2016). Since any form of Internet communication is singular in terms of levels of orality, the amount of visual vs. verbal material employed, and the (a)synchronicity of the medium, the existing diversity enables researchers to select the medium that has the greatest potential to answer the questions we endeavour to explore. In this case, I was interested in examining blog posts, which epitomise asynchronous online communication of a considerable length and with certain – albeit intuitive – textual conventions. Blogs are usually written by non-experts in the field and infused with a much more personal tone than other digital genres. As Zhang (2008: 37) suggests, “while the Internet has often been thought to decenter the role of the author (Bolter, 1991), blogs have often made the author the center of attention”. Still, it is not uncommon to find businesses advertising their services or products via blogs, which, nonetheless, use a narrative style that is considerably less detached from the customer than advertisements simply featuring goods. As for this particular project, I did not intend to analyse the different narratives of these two subtypes of blogs separately, I focused on the lexical 284 features of both personal and businesses’ blogs. My aim was to compile a sufficiently large enough dataset as to be able to trace any patterns, signs of language innovation, and oscillations in the use of native Spanish words vis-à- vis foreign words (namely, French-origin and English-origin words), so that I would be able to compare the lexis occurring in blogs to the one used in magazines (online and in print) at later stages. In so doing, my long-term aim is to examine to what extent bloggers are imbued with the discourse present in magazines and repeat the same lexical chunks – namely collocations – and conventions found in these more established genres, or if bloggers diverge from them in order to construct their virtual identities or internet personae. This will hopefully lead me to a deconstruction of this language for specific purposes as realised in these genres. Hence, even though the goal of this study is presently descriptive, once it has been completed, its findings may serve as a theoretical basis for the design of pedagogical material and activities (for example, to assist in the teaching of vocabulary and genre patterns specific to this area of LSP) in the long run. In the next section I will discuss why blogs are important in our current understanding of languages for specific purposes and how they can contribute to the discipline in comparison to well-established corpora. Subsequently, I will explain the type of data set I compiled, the procedure, and the criteria I followed for their compilation, so that researchers working not only in the area of lexicology but also in other fields of linguistics for specific purposes may draw inspiration and incorporate this 21st-century vast source of data into their practices. Blogs and 21st-century corpus-driven tools Whereas the potential of blogs for teaching a foreign language has been extensively explored (inter alia, Zhang, 2008; Thomas, 2009; Wang & Winstead, 2016), less has been written on their use in linguistic – more precisely, lexicological – research. However, their functionality cannot be neglected for several reasons. First, blogs contain highly structured information, being typically composed of a homepage (featuring the most recent posts) and an archive. Therefore, the procedure for compiling a blog a priori seems to be rather straightforward. Second, although the actual number of posts within each blog varies, the language samples they provide are considerably longer than other virtual genres. It is often argued that a KEEPING UP WITH THE DIGITAL AGE Ibérica 34 (2017): 283-292 285 blog should contain at least 300 words to rank well in search engines, although in the database I compiled the total number of words ranged from c. 200 to 700 words. Thus, this virtual space allows writers to devote as much time and space as they feel they need to fulfil their needs with no constraints. Third, the array of non-specialist and specialist voices which can be electronically heard through blog posts create a composite picture which can be used to characterise the speech of these virtual communities. Fourth, as Hundt et al. (2007: 3) point out, “for the study of certain phenomena, in particular neologisms, the web is and probably will be one of our best sources of information”. Like other web-based means of communication, blogs are indeed mines of language data which cover a wide range of topics and are constantly updated. Consequently, lexical items, which are incessantly evolving and being coined, are more likely to make their way into dynamic depositories or databases than into static corpora or dictionaries. Traditional corpora are finite bodies of text which provide a representative sample of a language either from a diachronic or a synchronic viewpoint. In general Spanish, the most emblematic corpora are CORDE (Corpus Diacrónico del Español), a diachronic corpus covering the earliest records written in Spanish up until the year 1974, CREA (Corpus de Referencia del Español Actual), comprising texts from 1974 to 2004, and the most recent one, CORPES XXI (Corpus del Español del Siglo XXI), which initially incorporated oral and written texts produced from 2001 to 2012 but the project is still underway and will culminate in 2018. General corpora are expected to include both “standard” and “non- standard” uses of the language so that they present opportunities to study the language in all its forms. However, the asymmetry between the number of “standard” and “non-standard” texts included is problematic when tracing and predicting the development of new “non-standard” uses. For example, in English such a large corpus as COCA (Corpus of Contemporary American English) only yielded eight tokens of “past tense spreading” with swim (i.e. *swimmed), whereas over 88,000 instances were retrieved from the Web (cf. Geeraert & Newman, 2011: 2). This figure is particularly significant for the language researcher, whose task is to document any exiting and emergent uses occurring in the language. When it comes to languages for specific purposes, particularly, in their didactic facet, ad hoc corpora are also developed to examine and AMANDA ROIG-MARÍN Ibérica 34 (2017): 283-292286 teach/learn the features of particular genres and language contexts which are not well represented in general corpora (on this topic, see, for instance, Bárcena, Read, & Arús, 2014). Specialised corpora vary greatly in size and level of specificity, and they may concentrate on very specific text types such as grant proposals (Connor & Upton, 2004) or on the language for specific purposes as a whole (for example, the Cambridge Business English Corpus). The dataset that I will broadly characterise in the succeeding section resembles specialised corpora, since the focus is exclusively on fashion blogs written in Spanish, although its relatively limited size (around 900,000 words) compared to well-established corpora prevents me from naming it as such. English blog-based corpora include the Birmingham Blog Cor pus (), consisting of 628,558,282 words extracted from blogs, and the Blog Authorship Cor pus (), which encompasses the posts of 19,320 bloggers from in August 2004, totalling over 140 million words. Nonetheless, these are General English corpora and texts are not categorised into any semantic fields “but split into sections according to how the texts were discovered and downloaded”, as the Birmingham Blog Corpus website explains. In this sense, the dataset I compiled has a much narrower scope, which may facilitate the task of the researcher interested in describing contemporary Spanish language of fashion and how fashionistas are portrayed in blogs. Previous research has already concentrated on a limited number of fashion blogs, but mostly from a completely different angle from the enterprise outlined in this paper: for instance, Ruiz Molina (2012) analysed the impact of blogs on consumer-centred companies and fashion journalism (also the main focus of Rocamora, 2012) by adopting a semiotic framework; Riera and Figueras Maz (2012) examined whether fashion blogs attempt to perpetuate the same idealised beauty standard as the one that is promoted in most fashion magazines, something which did not prove to hold true for all of the sub-categories of blogs (e.g. “ego-blogs”) equally; along these lines, Rocamora (2011) considered the process of identity construction and representation of femininity in this virtual space; and, more recently, Martínez Navarro and de Garcillán López-Rúa (2016) explored how the emergence and popularisation of this electronic medium has reshaped the practices of consumers through the conduction of interviews and group meetings with informants. On account of this, it becomes clear that the KEEPING UP WITH THE DIGITAL AGE Ibérica 34 (2017): 283-292 287 present piece of research is not pioneering on fashion blogs, although its more linguistically oriented aim does offer some scope for originality. Similarly, even though the use of corpus tools in the compilation and tagging of texts has been widely discussed in LSP research (see, amongst others, Flowerdew, 2005; Millar & Budgell, 2008; Carrió-Pastor & Muñiz-Calderón, 2013; Herrero-Zazo, Segura-Bedmar, & Martínez, 2013) and, to a lesser extent, in thematically diverse e-genres such as blogs (e.g. Wallsten, 2005; Lukač, 2011; Ptaszynski, 2012), substantially less has been written on their interface, that is, on corpus-based approaches to the field of fashion blogs, a gap which the investigation herein presented attempts to fill. A case study: The fashion blog database The dataset I assembled consists of the posts of 100 bloggers gathered from 2013 to 2016, amounting to 2,927 blog posts (~900,000 words). Many blogs were located within websites of magazines which count with sections devoted to fashion (for example, or ) although others had independent domains and were found by simple searches containing the key words blogs, moda, and España. The guiding principles underlying the selection of textual material revolved around the following main axes: (1) language, (2) length, (3) date of production, and (4) topic. A criterion that was essential for the purposes of this project was that all blog posts had to be written in Spanish. It was found that frequently bloggers wrote their posts in both English and Spanish, but this practice would not give us real insights into the state of Spanish (for instance, as far as the role of foreign lexis in Spanish is concerned). Accordingly, for the sake of consistency, all blogs had to satisfy this language criterion and be written in Spanish. Likewise, blogs could not exceed the maximum of 700 words nor could they be shorter than 200 words. In that way, despite variations, I made sure all blogs posts contained a minimum amount of text and were not simply made up of photos. As for the date of production, I concentrated on the time spanned from January 2013 to January 2016. This was a short time span as I intended to pinpoint the most recent lexical trends. However, this could be altered to suit the researcher’s needs. The same can be applied to the topic(s) discussed in blogs posts. AMANDA ROIG-MARÍN Ibérica 34 (2017): 283-292288 I only considered those texts that were within the sphere of fashion, excluding those posts, which, despite being part of a fashion blog, touched upon completely unrelated areas. This thematic decision entirely depends on researchers, their precise aims, and the specialised domains on which they are working (tourism, economics, natural sciences, or medicine, to name but a few). Once I had retrieved the posts that would be part of my dataset, I created a labelling system in order to identify each file (since each post had an independent file). These labels would include an initialism and a number. These two elements referred to (1) the name of the blog from which they were retrieved (e.g. PAM “Persiguiendo a Mar” or MDE “Maquillaje de estrellas” ), and (2) the date in which the post was released. If more than one post was produced on the same day, I numbered them (e.g. 05-12-2015(1)). For the purposes of this project, I removed all images and only kept written texts so that they could be easily processed electronically. Furthermore, I recorded basic information of each blog: its (1) name, (2) sub-topic (accessories or clothing, amongst others), and (3) type (personal blog, commercial blog, or others), and (4) the link from which it had been obtained. Depending on the type of study conducted, the researcher can also benefit greatly from knowing the main intended readers, author’s background, and other (con)textual elements. Nowadays electronic tools make it possible to handle large sets of data as these efficiently. There are tools (e.g. Word Smith or Sketch Engine) which work with any corpus (regardless of its size, type and language(s) in which it is written), and they provide basic features of head-words’ behaviour such as KWICs (Key Words in Contexts) and collocations (using Cowie’s (1981) and Howarth’s (1996) theoretical framework). Therefore, the use of blogs as primary material should not pose any problems in this regard. It is likewise worth mentioning that, despite the great assets of the online medium, apparently minor issues, such as misspellings, orthographic variations, or foreign words that, for example, have been unconventionally adapted to the Spanish graphemic system, could hinder vocabulary research to a certain extent. This is a pitfall which should be further addressed because not all misspellings can be predicted. Still, if they are recurrent spelling or grammatical errors, they may hint at possible extended patterns that would require further study. KEEPING UP WITH THE DIGITAL AGE Ibérica 34 (2017): 283-292 289 Concluding remarks This short article has aimed to provide researchers with basic guidelines on how to approach the compilation of online textual material. In particular, I have concentrated on blogs and described the procedure used to compile a real-life linguistic dataset, which may be replicated in almost any other field. Two of the greatest advantages of the Internet is that online users produce language fairly naturally, without being particularly aware that they might be observed for scholarship research – thereby avoiding the observer’s paradox –, and that information is constantly renewed and produced unlike paper-based lexicographical sources. One may argue that blogs will be completely superseded by vlogs or other genres, but whilst this shift may be noticeable in some Internet communities of users, the truth is that blogs still continue to be very much used in such fields as fashion or tourism. Precisely, research into specialised languages requires restricted subsets of language of the type that blogs present. For instance, if one is interested in examining in-group markers used in the language of fashion, it is very likely that fashions magazines will not attest them in the same way as fashionistas or fashion enthusiasts who are writing their own posts with a set of conventions and with a – many times interactive – audience in mind. By compiling a corpus of texts online, not only is it possible to capture myriads of snapshots of the language at that time but also to constantly add new texts which reflect the changing nature of language itself. All of this will hopefully inspire a much more flexible view on corpus-driven studies of languages for specific purposes. At least ever since 1980 traditional corpora have been used (Heuberger, 2016: 24), so it is time “to keep with up the digital age” and take full advantage of the spectrum of resources that the Internet offers. Article history: Received 05 September 2016 Received in revised form 13 December 2016 Accepted 15 December 2016 References AMANDA ROIG-MARÍN Ibérica 34 (2017): 283-292290 Bárcena, E., T. Read, & J. Arús (2014). Languages for Specific Purposes in the Digital Era. New York: Springer. Bolter, J.D. (1991). Writing Space: The Computer, Hypertext, and the History of Writing. Hillsdale, NJ: Lawrence Erlbaum Associates. Amanda Roig-Marín holds a BA in English and an MA in Spanish and English as SLs/FLs (both with “Distinction”), and she is currently doing an MPhil in Linguistics at the University of Cambridge. Her main research KEEPING UP WITH THE DIGITAL AGE Ibérica 34 (2017): 283-292 291 Carrió-Pastor, M.L. & R. Muñiz-Calderón (2013). “The compilation of a corpus of business English: Syntactic variation”. Procedia - Social and Behavioral Sciences 95: 89-95. Connor, U. & T.A. Upton (2004). “The genre of grant proposals: A corpus linguistic analysis” in U. Connor & T. Upton (eds.), Discourse in the Professions: Perspectives from Corpus Linguistics, 235-256. Amsterdam/Philadelphia: John Benjamins. Cowie, A.P. (1981). “The treatment of collocations and idioms in learners’ dictionaries” in Peter Strevens (ed.), In honour of A.S. Hornby, 223-235. Oxford: Oxford University Press. Flowerdew, L. (2005). “An integration of corpus- based and genre-based approaches to text analysis in EAP/ESP: Countering criticisms against corpus-based methodologies”. English for Specific Purposes 24,3: 321-332. Geeraert, K. & J. Newman (2011). “I haven’t drank in weeks: The use of past tense forms as past participles in English corpora” in J. Newman, H. Baayen & S. Rice (eds.), Corpus-Based Studies in Language Use, Language Learning, and Language Documentation, 13-34. Amsterdam/New York: Rodopi. Herrero-Zazo, M., I. Segura-Bedmar & P. Martínez (2013). “Annotation issues in pharmacological texts”. Procedia - Social and Behavioral Sciences 95: 211-219. Heuberger, R. (2016). “Corpora as game changers: The growing impact of corpus tools for dictionary makers and users”. English Today 32: 24-30. Howarth, P.A. (1996). Phraseology in English Academic Writing. Tübingen: Max Niemeyer. Hundt, M., N. Nesselhauf & C. Biewer (2007). Corpus Linguistics and the Web. Amsterdam/New York: Rodopi. Lukač, M. (2011). “Down to the bone: A corpus- based critical discourse analysis of pro-eating disorder blogs”. Jezikoslovlje 12,2: 187-209. Martínez Navarro, G. & M. de Garcillán López-Rúa (2016). “The influence of blogs in fashion in consumer behavior: An exploratory approach”. Vivat Academia 135: 85-109. Millar, N. & Budgell, B. (2008). “The language of public health - a corpus-based analysis”. The Journal of Public Health 16,5: 369-374. Page, R., D. Barton, J.W. Unger & M. Zappavigna (2014). Researching Language and Social Media: A Student Guide. New York: Routledge. Ptaszynski, M. et al. (2012). “Annotating syntactic information on 5.5 billion word corpus of Japanese blogs” in Proceedings of The 18th Annual Meeting of The Association for Natural Language Processing (NLP-2012), 385-388. New York: Curran Associates. Riera, S. & M. Figueras Maz (2012). “El modelo de belleza de la mujer en los blogs de moda.¿Una alternativa a la prensa femenina tradicional?”. Cuestiones de género: de la igualdad y la diferencia 7: 157-176. Ruiz Molina, E. (2012). Blogs de moda: un análisis semiótico. Barcelona: Design, Knowledge & Future. Rocamora, A. (2011). “Personal fashion blogs: Screens and mirrors in digital self-portraits”. Fashion Theory 15,4: 407-424. Rocamora, A. (2012). “Hypertextuality and remediation in the fashion media: The case of fashion blogs”. Journalism Practice 6,1: 92-106. Thomas, M. (2009). Handbook of Research on Web 2.0 and Second Language Learning. Hershey/New York: IGI Global. Wallsten, K. (2005). “Political blogs and the bloggers who blog them: Is the political blogosphere and echo chamber”. In American Political Science Association’s Annual Meeting. Washington, DC September, 1-36. Online. URL: Wang, C & L. Winstead (2016). Handbook of Research on Foreign Language Education in the Digital Age. Hershey/New York: IGI Global. Zappavigna, M. (2012). Discourse of Twitter and Social Media. London: Bloomsbury. Zhang, F. (2008). Handbook of Research on Computer-Enhanced Language Acquisition and Learning. Hershey/New York: IGI Global. interests include English lexicology, Spanish-English lexical influences, and historical linguistics. She has published on these areas and delivered papers at national and international conferences. NOTES 1 Headline retrieved from . 2 Retrieved from . 3 Retrieved from . AMANDA ROIG-MARÍN Ibérica 34 (2017): 283-292292