Séminaire LRL "Building a Reference corpus of CMC"

mercredi 26 juin, 16h, salle 220 de la MSH
Michael Beißwenger, TU Dortmund University

The talk reports from an ongoing project that aims at building a reference corpus of German computer-mediated communication (CMC) as a new component of an exist-ing reference corpus of written contemporary German. The ‘Deutsches Referenzkor-pus zur internetbasierten Kommunikation’ (DeRiK) – a joint project of TU Dortmund University and the Berlin-Brandenburg Academy of Sciences and the Humanities (BBAW) – shall include data from the most prominent CMC genres amongst German Internet users and, thus, close a gap in the coverage of the corpus resources in the project ‘Digitales Wörterbuch der deutschen Sprache’ (DWDS) which are maintained and provided by the BBAW.

The talk will start with an overview of the DWDS balanced reference corpus of writ-ten contemporary German and then outline selected crucial issues that one faces when designing, collecting processing and annotating a reference corpus of genres of written CMC. It will address sampling issues as well as issues related with the automatic linguistic processing (tokenization, POS tagging) of “nonstandard” phenomena in CMC data.

A special focus will be put on questions related with the annotation of CMC genres : Even though there’s been a lot of research on CMC in the field of linguistics, social sciences and natural language processing, there are still no common standards for the annotation of these new forms of communication and their structural and lin-guistic peculiarities. On the example of an annotation schema for CMC which has been created for the use in DeRiK, the talk will discuss the advantages and challenges of modelling CMC genres building on the encoding framework provided by the Text Encoding Initiative (TEI). This will include a brief overview of the ideas and structure of the TEI framework in general. As an outlook, the talk will outline the vision of developing a general schema for the representation CMC genres as a sug-gestion for a new component of a future version of the TEI standard. Such a standard would allow for an interchange of CMC data between research groups in the field of Digital Humanities and for building interoperable CMC corpora for different languages.

