Workshops > Workshop 1B - 11:45-1:30

Creating Public Corpora

Bill Kretzschmar
University of Georgia

Becky Childs
Stefan Dollinger

For many years sociolinguists have gathered field evidence and, in effect, created their own personal corpora of recordings and transcriptions. Such collections have been handled pragmatically: researchers stored recordings as it suited them, and created transcripts that met their own needs. Now, many sociolinguists are interested in the creation of public corpora, so that other researchers and the public at large can benefit from our collections. This workshop addresses key issues in the creation of public corpora, and offers examples of existing public corpora that demonstrate best practices in their creation.

1) We all have two audiences, the general public audience, and the specialist audience.

2) General users of our collections have basic needs for locating and uses resources.

3) Specialists need the same information, but with more particular requirements. Including access to comprehensive data, metadata in plain format, to make high-level interpretations

I. Accessibility
Becky Childs, Coastal Carolina University

Making a web interface for the dissemination of language data to both linguists and the general public is not as simple as the creation of a static website that will allow for a user to “surf” in and download, view, or listen to recordings and transcripts. With users increasingly aware of interfaces like Youtube for the storage and processing of streaming media, it becomes advantageous to use similar interfaces for data storage and dissemination in public venues, thus requiring little to no training or instruction for any user. The realities of these issues will be demonstrated through discussion of the Horry County Oral History and Language Project.

II. Copyright and Corpus Enhancement
Stefan Dollinger, University of British Columbia

The Bank of Canadian English (BCE) is a database created in connection with the revision of the Dictionary of Canadianisms on Historical Principles. BCE has been conceived from the outset as a project that allows for non-lexical, linguistic applications and includes features not normally found in citations databases, such as regional searches and basic concordancing functions. I will address some copyright issues when aiming to provide access to legacy print materials, and the problem of making historical and contemporary interview data available at (presently accessible by request).

III. Human Subjects and Metadata
Bill Kretzschmar, University of Georgia

For building public access data bases, the practices for the Digital Archive of Southern Speech (DASS) and Roswell Voices projects can serve as a good model for both legacy and new interview materials. DASS is a new collection of interviews sampled from the Linguistic Atlas of the Gulf State. These interviews are archival data, and so are exempt from Human Subjects restrictions, but still we protect our speakers. I will demonstrate the process for "beeping out" sensitive information from these legacy files.  New interviews for Roswell Voices are subject to IRB review. Our innovation is to have speakers sign an additional contract that assigns the copyright interest in their interview to the Roswell Folk and Heritage Bureau. I will demonstrate a software toolbox called LICHEN, that offers integrated access to all data; and can be for any project.


Dollinger, Stefan. Forthcoming. “Software for the Bank of Canadian English as an open source tool for the dialectologist: and its features” – in: Manfred Markus, Clive Upton and Reinhard Heuberger (eds.) Wright's English Dialect Dictionary and Beyond. Amsterdam: Benjamins.

Dollinger, Stefan, Laurel Brinton, Margery Fee, Breanna Laing, Gerard Van Herk, Rose-Marie Dechaine, Gary Libben and Elaine Gold. 2008. Round Table Session on “The Bank of Canadian English and Linguistic Research in Canada”. Canadian Linguistic Association, 2 June, 2008, Vancouver, BC, Canada.

Kretzschmar, William A. Jr., Jean Anderson, Joan C. Beal, Karen P. Corrigan, Lisa Lena Opas-Hänninen and Bartlomiej Plichta. 2006. Collaboration on Corpora for Regional and Social Analysis. Journal of English Linguistics 34(3):172-205.