Corpora de Inglês

(Quer sugerir um link? Encontrou um link não operante?
Por favor, escreva para: projetocomet@edu.usp.br)

BNC (British National Corpus)
A large (100 million words) corpus of modern English (1990's). BNC World Edition is now available.

ANC (American National Corpus)
Still under construction, it's the American version of the BNC.

COBUILD
Offers access to a large corpus for a fee. Also has a free demo.

ICE (International Corpus of English)
Under construction; to include national and regional varieties of English. ICE-GB has a 20,000 word sample for download. Organized by the University of Massachusetts at Boston.

Wellington Corpus of New Zealand English
Written and Spoken New Zealand English corpora for sale.

ICAME (International Computer Archive of Modern and Medieval English)
International organization of linguists and information scientists working with English machine-readable texts. The aim of the organization is to collect and distribute information on English language material available for computer processing and on linguistic research completed or in progress on the material, to compile an archive of English text corpora in machine-readable form, and to make material available to research institutions.

Corpus of Written British Creole
To those interested in Caribbean Creole and its development, especially outside the Caribbean, it is a 12,000 word corpus of non-standard written language variety which allows researchers to see an unstandardised language developing its written form - a stage which English reached at least five centuries ago.

CELT (Corpus of Electronic Texts)
It has a searchable online database consisting of contemporary and historical Irish texts from many areas, including literature and the other arts.

Corpus of Spoken, Professional American-English
Constructed from a selection of existing transcripts of interactions in professional settings, it contains two main sub-corpora of a million words each: one sub-corpus consists mainly of academic discussions, the second contains transcripts of White House press conferences. The corpus is available commercially and there is a 50,000 word sample available online. Tagged version also available

COLT (The Bergen Corpus of London Teenage Language)
COLT is the first large English Corpus focusing on the speech of teenagers. It was collected in 1993 and consists of the spoken language of 13 to 17-year-old teenagers from different boroughs of London. The complete corpus, half a million words, has been orthographically transcribed and word-class tagged, and is a constituent of the British National Corpus.

* para mais informações sobre corpora de inglês, consulte o site de David Lee: http://devoted.to/corpora