Workshops > Workshop 1A - 11:35-1:30

Automated data extraction techniques for sociophonetic corpus research

Keelan Evanini and William Labov
University of Pennsylvania

The standard practice in sociophonetic research is to extract acoustic measurements manually from the audio file, with the help of a speech analysis program, such as Praat. It has long been maintained that manual annotation is necessary in order to collect valid and reliable measurements. However, due to the time involved in such manual analysis, adherence to this procedure means that sociolinguistic corpora analyzed in this manner must be relatively small. Analyses involving larger corpora take several years to complete, e.g. Labov, Ash, and Boberg (2006). In addition to being labor-intensive, manual analysis of one particular acoustic feature in a corpus does not prepare the corpus for future research of other acoustic phenomena. For these reasons, it is clear that the development and refinement of automated methods of sociophonetic analysis is necessary for producing large-scale studies with reproducible results.

In the first part of this workshop, we will discuss reasons why manual measurements have been considered necessary in the past. Focusing on the specific case of vowel formant measurement, we will present examples where corrections by human annotators are necessary to improve upon automatically produced measurements, and we will discuss why such issues have led to an enduring preference for manual measurements in sociophonetic research.

In the second part of the workshop, we will present the technique of Forced workshop, we will present the technique of Forced Alignment (FA), which takes as input an audio file and its transcription and outputs time stamps on both the word level and on the phoneme level. This information allows acoustic information to be extracted for any individual token or all tokens in a phonemic class. It also creates a corpus that can be searched quickly in case a manual analysis is still desired. We will argue that the application of FA to sociophonetic research will lead to a great increase in the amount of data and types of phenomena that can be analyzed.

The FA portion of the workshop will contain the following sections:

  • a brief overview of the mathematical underpinnings of FA, focusing on helping linguists understand the types of errors that might arise
  • an introduction to the FA system currently in use at Penn and a stepby-step tutorial on how to use the software
  • a presentation of best practices for sociolinguistic data collection and transcription that will help produce more accurate FA
  • a case study on vowel duration using the Atlas of North American English in which over 100,000 measurements were extracted using FA