Corpus Linguistics

Updated 7 July, 2022

Posted 21 May, 2018

What is corpus linguistics?

Corpus Linguistics (CL) can be defined as “a branch of linguistics that bases its research on data obtained from corpus” (Martín Peris et al., 2008; translated into English from Spanish). It is not a linguistic discipline proper but a methodological approach that can be used in a multiplicity of disciplines. Due to this interdisciplinarity, it is indeed becoming increasingly important in current linguistic studies.

CL is based on the elaboration and analysis of linguistic corpora. A corpus is a compilation of real text samples of a language (taken from novels, plays, scripts, news, essays, transcripts of radio and television programs, conversations or even speeches). They usually appear online or in electronic format, since their size is often large. Selected according to previously determined objective criteria, the samples are aimed to provide a representation of the language under study. In this way, “la representatividad es, pues, la piedra angular de la LC, pues de ello depende que se puedan extraer conclusiones fiables a partir de los datos estadísticos” (Cruz Piñol, 2012: 36). In sum, corpora are an important tool in linguistics because they allow research on different aspects of a language or a variety thereof.

The criteria used to choose texts determines the representativeness of corpora. The following classification is often made (EAGLES, 1996):

General corpora: they provide a representative and thorough sample of the different varieties, structures, and vocabulary of a language, which is why they are large in size; this type of corpus allows the study of the properties of a language at different stages of development.
Specialized corpora: despite the controversy around their definition, it is often agreed that they pursue a more specific purpose and are smaller in size than general corpora; they aim to be representative of a sublanguage or of the use of a language by a group of speakers with a common linguistic behavior.
Bilingual or multilingual corpora: two types can be found within them,
- parallel corpora: they include original texts and their translations to one or several languages.
- comparable corpora: they include similar texts in several languages, which allows comparison between them.

Corpora in Spanish

Corpus de Referencia del Español Actual (CREA)

It is a general corpus developed by the Instituto de Lexicografía de la Real Academia de la Lengua Española. This corpus includes oral and written texts produced in Spanish speaking countries from 1975 to 2004. The written samples come from magazines, newspapers and books from different fields: literary, journalistic, scientific, and technical. The spoken samples (only 10%) are taken from radio and television transcripts. This corpus has recently added an annotated version which allows searching forms, lemmas and grammatical categories.

Corpus Diacrónico del Español (CORDE)

It is a written corpus also developed by the Instituto de Lexicografía de la Real Academia de la Lengua Española. It includes texts from all periods in history and from every place where Spanish was spoken from its origin to 1974. It has 250 million samples from different genres.

Corpus del Español del Siglo XXI (CORPES XXI)

Promoted by Real Academia Española and Asociación de Academias de la Lengua Española (ASALE), this general corpus is still under construction. It includes written and spoken texts produced by Spanish speakers between 2001 and 2012. This corpus aims to provide a continuation to CREA and CORDE. Texts have been extracted from books, paper and online press, and audiovisual channels.

Mark Davis’ Spanish Corpus:

It is a general corpus funded by the National Endowment for the Humanities program in the United States. This corpus is included in the collection of BYU corpora and is divided in two parts:

genre/historic, from the 18th to 20th century; it contains oral, fictional, journalistic and academic texts (100 million words).
webs/dialects, it includes more recent texts (three or four years old) extracted from webpages from 21 Spanish-speaking countries (2000 million words).

In addition, it should be noted that its interface permits the creation of “virtual corpora” on the basis of a selection of authors, sources and topics.

Corpus Oral y Sonoro del Español Rural (COSER)

It is an ongoing oral corpus. Samples included in this corpus have been collected since 1990. The sampled speakers are 70.7 years old on average and have minimal schooling. Thus, this corpus allows to research the dialectal variation of spoken Spanish in rural areas in the Iberian Peninsula. Queries provide audio tracks of every interview and its corresponding transcript.

Corpus Oral de Referencia de la Lengua Española Contemporánea (CORLEC):

Promoted by Universidad Autónoma de Madrid, this general corpus can be downloaded for free from the university’s webpage. This corpus allows research into spoken Spanish between 1991 and 1992. Texts come from a wide range of genres (administrative, scientific, legal, media…) but conversational texts are higher in number.

Corpus Val.Es.Co 2.0:

This corpus of informal Spanish includes 46 conversations and was created by a research group from the English department at Universidad de Valencia. The interface gives access to conversations by searching topics, interventions or intonation groups; therefore, the results can be very accurate.

Proyecto para el Estudio Sociolingüístico del Español de España y de América (PRESEEA):

PRESEEA is a project started by Asociación de Lingüística y Filología de la América Latina (ALFAL) and aimed at creating a corpus of spoken Spanish representative of the Hispanic world in all its geographical and social variety. The corpus started in 1996 and continues at present.

Corpora in French

Traitment de Corpus Oraux en Français (TCOF):

It is an oral corpus that started collecting material in the 80s and 90s, but it was later resumed in 2000. The corpus includes two main categories: interactions between adults and children (up to 7 years old) and interactions among adults. The interface allows to filter the results according to different criteria such as: context (public, private, professional or academic), channel (face to face, on the phone, videoconference, television or radio), the relationship among the participants, sex, knowledge of the French language, etc.

FRANTEXT

It is a textual corpus developed by Analyse et Traitment Informatique de la Langue Française and Le Centre National de la Recherche Scientifique. The texts, from 1180 to 2009, are literary, philosophical, and scientific and technical (only 10%).

Corpus de la Langue Parlée en Interaction (CLAPI):

It is an oral corpus of French speakers from France, Switzerland and Germany. Samples have been extracted between 1984 and 2008. They are conversations in different contexts between either native or non-native speakers of different ages.

Enquêtes Sociolinguistiques à Orléans (ESLO):

It is an oral corpus that includes transcripts collected in Orleans. This corpus includes two parts. The first one (ESLO1) was collected between 1968 and 1974, and the second (ESLO2) has been compiled since 2008.

Corpus de Français Parlé Parisien des années 2000 (CFPP 2000):

It is an ongoing oral corpus of interviews to adults from the suburbs and outskirts of Paris. It allows research into “common” oral Parisian French. It contained 500,000 words as of 2011 but it aims to reach one million words.

Corpora in English

British National Corpus (BNC):

(choose the option “BNCWeb at Lancaster University” and sign up for free)

British National Corpus is an oral and written corpus containing one hundred million words of British English from the late XX century. The written part (90%) includes newspapers and journals for all kind of readers, academic and fictional books, published and unpublished letters, and university and school essays. The spoken part (10%) includes transcripts of informal conversations and spoken language in different contexts, from business to government meetings, and television and radio programs.

Corpus of Contemporary American English (COCA):

(It is necessary to sign up and it has a limited number of free queries a day)

It is the largest free available corpus of American English. It has more than 250 million words and includes spoken, fictional, popular journal, newspaper and academic texts.

BROWN Corpus:

(Download here)

BROWN was the first general corpus available in electronic format. It was started in Brown University in the 1960s (samples were collected between 1963 and 1964) and contains one million words of American English. This corpus has been used as the basis for others such as Lancaster-Oslo/Bergen Corpus (LOB), (British English), Freiburg-Brown Corpus of American English (FROWN), and Freibur-LOB Corpus of British English (FLOB).

News on the Web Corpus (NOW):

This corpus contains 5.2 million words extracted from online newspapers and journals from 2010 to present. The interface allows to create personalized collections of texts based on registers or webpages.

Multilingual, specialized, and parallel corpora

Hansard Corpus:

Hansard Corpus contains 1.6 billion words of speeches from the British Parliament from 1803 to 2005. This corpus was created as a part of SAMUELS project (Semantic Annotation and Mark-Up for Enhancing Lexical Searches), developed between 2014 and 2016. Queries can be limited and compared according to decades, parties and houses.

Corpus paralelo ACTRES:

This corpus includes Englis texts and their translations into Spanish as well as Spanish texts and their translations into English. It consists of more than 4 million words between both languages. Texts have been extracted from fictional and nonfictional books, newspaper and journal articles, and other types of texts.

Fono.ELE:

This corpus contains samples from Spanish learners from six different nationalities (German, Greek, Taiwanese, Polish, Portuguese, and Egyptian) with a level from A2 to C1. It is very useful because it allows research into students’ acquisition of Spanish, especially in the phonic level.

CHILDES:

This corpus centers on the child language in Spanish, French, German, Japanese, English, etc. It focuses on bilingual children and children with language impairment. Each oral text sample comes with its transcript and both files can be downloaded. In addition, the syntactic and morphological tagging of the texts is outstandingly complete. CHILDES is part of a larger corpus, Talkbank, whose aim is to help research into human communication in general.

TEXTUAL CORPUS FOR RESEARCH AND DEVELOPMENT:

-ELRA-M0001 Basic multilingual lexicon (MEMODATA).

-ELRA-L0042 PAROLE Spanish Lexicon.

-ELRA-W0017 MULTEXT JOC Corpus.

-ELRA-W0053 Catalan-Spanish Parallel Corpus.

-ELRA-S0149 Spanish Speech Corpus 1.

-ELRA-M0015 EuroWordNet English Addition to English WordNet.

-ELRA-M0017 EuroWordNet Spanish.

USEFUL LINKS

International Journal of Learner Corpus Research

International Journal of Corpus Linguistics

Linguist List – Text & Corpora