- Your comments
- Join Kobo & start eReading today
- Join Kobo & start eReading today
- A guide to the dialects and words of Scotland’s regions - The Scotsman
The main challenge associated with corpus-based dialectology is sampling natural language in sufficient quantities from across a region of interest to permit meaningful analyses to be conducted. The rise of corpus-based dialectology has only become possible with the rise of computer-mediated communication, which deposits massive amounts of regionalized language data online every day.
Aside from early studies based on corpora of letters to the editor downloaded from newspaper websites e. Research on regional lexical variation on American Twitter has been especially active e. For example, Huang et al. Twitter has also been used to study more specific varieties of American English. For example, Jones analyzed regional variation in African American Twitter, finding that African American dialect regions reflect the pathways taken by African Americans as they migrated north during the Great Migration.
There has been considerably less Twitter-based dialectology for British English. Most notably, Bailey , compiled a corpus of UK Twitter and mapped a selection of lexical and phonetic variables, while Shoemark et al. In addition, Durham used a corpus of Welsh English Twitter to examine attitudes toward accents in Wales, and Willis et al.
Research in corpus-based dialectology has grown dramatically in recent years, but there are still a number of basic questions that have yet to be fully addressed. Perhaps the most important of these is whether the maps of individual features generated through the analysis of Twitter corpora correspond to the maps generated through the analysis of traditional survey data. Some studies have begun to investigate this issue.
For example, Cook et al. Similarly, Bailey , found a general alignment for a selection of features for British English. While these studies have shown that Twitter maps can align with traditional dialect maps, the comparisons have been limited—based on some combination of a small number of hand selected forms, restricted comparison data e. A feature-by-feature comparison of Twitter maps and survey maps is needed because it is unclear to what extent Twitter maps reflect general patterns of regional linguistic variation.
The careful analysis of a large and representative Twitter corpus is sufficient to map regional patterns on Twitter, but it is also important to know if such maps generalize past this variety, as this would license the use of Twitter data for general investigations of regional linguistic variation and change, as well as for a wide range of applications.
- Walk Wit Me...: All Ova Guyana.
- Fold & Go Dollhouse.
- Dictionary of the Scots Language :: Bibliography.
- Original Research ARTICLE;
The primary goal of this study is therefore to compare lexical dialect maps based on Twitter corpora and survey data so as to assess the degree to which these two approaches to data collection yield comparable results. We do not assume that the results of surveys generalize; rather, we believe that alignment between these two very different sources of dialect data would be strong evidence that both approaches to data collection allow for more general patterns of regional dialect variation to be mapped.
A secondary goal of this study is to test how consistent dialect patterns are across different communicative contexts. Corpus-based dialectology has shown that regional variation pervades language, even in the written standard Grieve, , but we do not know how stable regional variation is on the level of individual linguistic features. To address these gaps in our understanding of regional linguistic variation, this paper presents the first systematic comparison of lexical dialect maps based on surveys and Twitter corpora.
Specifically, we report the results of a spatial comparison of the maps for lexical variants based on a multi-billion-word corpus of geocoded British Twitter data and the BBC Voices dialect survey. Interest in regional dialect variation in Great Britain is longstanding, with the earliest recorded comments on accent dating back to the fifteenth and sixteenth centuries Trevisa, The study of regional variation in lexis grew in popularity during the late eighteenth and early nineteenth centuries, with dialect glossaries being compiled across the country, especially in Yorkshire and the North, in order to preserve local lexis, which was assumed to be going extinct.
- Variation in Scotland: The Linguistic Status of Scots Then and Now - Melanie Bobik - Google книги.
- Share this page!
- Original Research ARTICLE!
- Dictionary of the Scots Language :: Origins.
Most notably, Wright's English Dialect Dictionary , which drew on many of these glossaries, detailed lexical variation across the British Isles, especially England. The earliest systematic studies of accents in England also began around this time see Maguire, Data was collected between and in primarily rural locations using a 1, question survey, which included lexical questions. Respondents, typically older males who had lived most of their lives in that location, were interviewed face-to-face by a fieldworker.
The rest of the UK was covered separately. Scotland and Northern Ireland, along with the far north of England, were mapped by The Linguistic Survey of Scotland , which began collecting data in through a postal questionnaire Mather et al. This survey also mapped regional variation in Scottish Gaelic O'Dochartaigh, Finally, both Welsh Jones et al.
With the rise of sociolinguistics in the s and s, work on language variation and change in the UK shifted focus from regional patterns to social patterns, generally based on interviews with informants from a range of social backgrounds from a single location. Interest in regional dialects, however, began to re-emerge recently. A national survey was never conducted, but the SuRE method was adopted for research in individual locations, including by Llamas in Middlesbrough, Asprey in the Black Country, and Burbano-Elizondo in Sunderland. BBC Voices was designed to provide a snapshot of modern language use in the UK and employed various methods for data collection, including group interviews Robinson et al.
This lexical data, discussed below, is the basis for the present study.
It has previously been subjected to statistical analysis Wieling et al. In , Bert Vaux initiated the Cambridge online survey of World Englishes, which collects data on 31 alternations of various types from across the world, including the UK. MacKenzie et al.
Join Kobo & start eReading today
Finally Leemann et al. There is also a long history of corpus-based research in British dialectology. Most research on Old and Middle British dialects is essentially corpus-based, as it relies on samples of historical writing e. Informants were recorded in their home and encouraged to talk about any subject they pleased to elicit naturalistic speech. The second was the 2. Because these datasets consist of transcriptions of interviews elicited from a small number of informants, they fall in between traditional dialect surveys and the large natural language corpora that are the focus of this study.
Despite this long tradition of research, relatively little is known about regional linguistic variation in contemporary British English, especially compared to American English and especially in regard to lexical and grammatical variation. In large part this is because so few researchers have yet to take advantage of the immense social media corpora that can now be compiled and whose popularity is driving dialectology around the world. In addition to comparing lexical variation in corpora and surveys, a secondary goal of this study is therefore to encourage the adoption of computational approaches in British dialectology.
The regional dialect survey data we used for this study was drawn from the BBC Voices project Upton, 1. We chose this dataset, which was collected online between and , not only because it is easily accessible, but because it is the most recent lexical dialect survey of British English and because it focuses on everyday concepts, whereas older surveys tended to focus on archaic words and rural concepts, which are rarely discussed on Twitter. The criteria for the selection of these 38 questions is unclear. Some e. In addition, two questions male partner, female partner are associated with variants that are not generally interchangeable e.
All informants did not respond to all questions.
Join Kobo & start eReading today
The most responses were provided for drunk 29, and the fewest for to play a game 9, Across all responses, 1, variants were provided, with the most for drunk and the fewest for mother The large number of variants associated with each alternation is problematic because if we considered the complete set, our comparison would be dominated by very uncommon forms, which cannot be mapped accurately.
Consequently, we only considered the most common variants of each alternation. In doing so, however, we violated the principle of accountability , which requires all variants to be taken into consideration Labov, Fortunately, this frequency distribution ensures that excluding less common variants, which contribute so few tokens, will have almost no effect on the proportions of the more common variants.
We tested other cut-offs, but higher thresholds e. Not only is each alternation associated with multiple variants, but each variant is associated with multiple distinct orthographic forms. These are the specific answers provided by informants that were judged by the BBC Voices team to be closely related to that variant, including inflections, non-standard spellings, and multiword units. Across all responses, 45, distinct forms were provided ignoring capitalization , with the most for unattractive 2, and the fewest for a long seat The large number of forms associated with each variant is also problematic, especially because many of the most uncommon forms are of unclear status.
Fortunately, the frequency distribution also allowed us to exclude less frequent forms from our analysis without affecting the regional patterns of more frequent variants. For each variant we only included forms that were returned by at least 50 informants. At the end of this process, our final feature set includes 36 alternations e. The complete set of alternations and variants is presented in Table 1. The complete set of forms are included in the Supplementary Materials. The number of variants per alternation ranges from 2 to 7, most with 4 variants; the number of forms per variant ranges from 1 to 12, most with 2 forms.
This situation is problematic and points to a larger issue with polysemy and homophony in our feature set, which we return to later in this paper, but crucially because the proportional use of each variant is calculated relative to the frequency of the other variants of that alternation, the maps for these overlapping variants are distinct.
A guide to the dialects and words of Scotland’s regions - The Scotsman
After selecting these variants, we extracted the regional data for each from the BBC Voices dataset, which provides the percentage of informants in UK postal code areas who supplied each variant. Notably, these two extreme postal code areas have the fewest respondents, leading to generally less reliable measurements for these areas. Most areas, however, are associated with far more informants and thus exhibit much more variability.
There are also a very small number of missing data points in our BBC Voices dataset 48 out of 17, values , which occur in cases where no responses were provided by any informants in that postal code area for that question. Because this is a negligible amount of missing data and because it is distributed across many variants, we simply assigned the mean value for that variant across all locations to those locations. In addition, because the BBC Voices dataset provides percentages calculated based on the complete set of variants, whereas we are looking at only the most common variants, we recalculated the percentage for each variant in each postal code area based only on the variants selected for analysis.
For example, in the Birmingham area, the overall percentages for cack-handed Finally, we mapped each of the variants in this dataset. In this case, a clear regional pattern can be seen within and across variants, with sofa being relatively more common in the South, couch in Scotland, and settee in the Midlands and the North of England.