Fostering linguistic studies on Wikipedia discussions. Multilingual corpus building, annotation and exploration tools
Bibliographie
C A L L F O R P A R T I C I P A T I O N
French-German Colloquium WikiCorp 2018
Fostering linguistic studies on Wikipedia discussions.
Multilingual corpus building, annotation and exploration tools
Two-day colloquium at Université Nice Côte d’Azur (FR) July 9-10,
2018
Invited speakers
David Laniado, Eurecat Barcelona
Torsten Zesch, Universität Duisburg-Essen
WikiCorp 2018 Website : http://www1.ids-mannheim.de/kl/nice2018
Registration
If you would like to participate, please fill in the registration form
at http://www1.ids-mannheim.de/fileadmin/kl/nizza/registration_form..pdf
and send it until 2 July 2018 via email to Celine.Poudat@unice.fr
Organisers
Céline Poudat (Université Nice Côte d’Azur)
Angelika Storrer (Universität Mannheim)
Harald Lüngen (Institut für Deutsche Sprache, Mannheim)
Laura Herzberg (Universität Mannheim)
Local organisation : Céline Poudat, Daria Cialon, Magali Guaresi and BCL
team in Nice.
Funding : Huma-Num CORLI consortium
Confirmed Participants (last updated 2018-05-02)
Natalia Grabar, STL, Université Lille 3
Laura Herzberg, Universität Mannheim
Mai Ho-Dac, CLLE-ERSS, Université Toulouse
Marc Kupietz, Institut für Deutsche Sprache, Mannheim
David Laniado, Eurecat, Barcelona
Harald Lüngen, Institut für Deutsche Sprache, Mannheim
Christophe Parisse, Head of Ortolang, MoDyCO, Université Paris X-Nanterre
Céline Poudat, BCL, Université Côte d’Azur
Angelika Storrer, Universität Mannheim
Serena Villata, Wimmics, Université Côte d’Azur
Torsten Zesch, Universität Duisburg-Essen
Location : Campus Saint-Jean-d’Angely 3, MSHS building, Salle Plate.
PRELIMINARY SCHEDULE (last updated 2018-05-29)
Monday, 9 July 2018
9:30-10:00 Opening
10:00-12:00 Section I : Joint corpus building, standards, and tools
- Short presentations on Features of the French and
German Wikipedia corpora, From Wiki dump and wiki text
to TEI, EuReCo idea and state of affairs
- Discussion and documentation of desiderata and requirements
12:00 - 13:30 Lunch
13:30-16:00 Section II : Linguistic Analyses of social interaction and conflicts
Invited talk : Torsten Zesch : Annotating, Detecting, and
Understanding Stance in Computer-Mediated Debates
- Short presentations on annotation categories for
linguistic analysis of interaction patterns, conflict
analysis, conflict detection
- Discussion and documentation of desiderata and
requirements.
16:00-16:30 Coffee
16:30 - 17:30 Breakout Session - Breakout Session -
ad Section I(a) ad Section II
20:00 Dinner
Tuesday, 10 July 2018
9:30-10:00 Documentation of the Results of the two Breakout Sessions from Day 1
10:00-12:30 Section III : Corpus analysis methods
Invited talk : David Laniado :
Visualisation of Wikipedia Interactions (working title)
- Short Presentations on Exploring French and German
Wikipedia discussion corpora using Hyperbase/Textométrie
and KorAP, Visualisation of word usage histories using
word embeddings
- Discussion and documentation of desiderata and
Requirements
12:30- 14:00 Lunch
14.00-15:00 Breakout Session - Breakout Session -
ad Section III ad
Section I (b)
15:00-15:30 Documentation of the results of the two parallel Breakout
Sessions
15.30-16:00 Coffee
16:00-17:30 - Planning the post-conference publication.
- Planning the implementation of results, follow-up
activities, projects, and further co-operation
- Wrap-up of the colloquium
Background
Wikipedia is one of the most successful projects of the Web 2.0. Since
its launch in 2001, thousands of contributors have built this huge
knowledge resource, which is not only used as an online encyclopedia,
but also as an object of research in many academic disciplines. It
also constitutes a rich and unique resource for linguistic studies,
first of all because of its multilinguality, and secondly because of
its huge discussion spaces, in which the collaborative writing effort
is negotiated. These so-called talk pages can be used as big corpus
resources of Computer-Mediated Communication (CMC).
The French and German participants of the colloquium are part of an
initiative which aims to foster linguistic studies on Wikipedia,
providing recommendations for the building of Wikipedia standardized
corpora, methods for their linguistic processing and exploration, and
descriptors and annotations for the analysis of talk pages. The
French-German team of proposers started co-operating in 2016 with a
first workshop in Mannheim entitled “Wikipedia : Discourse and corpus
linguistic perspectives”. Since then, the proposers and other
participants have co-operated in various constellations on
conferences, for joint publications and proposals. The group is now
ready to prepare the ground for jointly building comparable
French-German corpora to be used in cross-lingual, corpus-based
analyses of Wikipedia discussions.
Up to now, most linguistic studies on Wikipedia are focused on the
article pages, and do not go into a deep analysis of the linguistic
features used in the discussion spaces. This may be due to three
reasons : (i) Wikipedia is quite a complex object that linguists have
difficulties to manipulate ; (ii) Wikipedia interactions need specific
descriptors and ad hoc annotations for analysis ; and (iii) existing
corpus technologies and exploration tools need to be adjusted to the
specificities of CMC corpora in general and Wikipedia corpora in
particular. More sophisticated tools and methods for the linguistic
annotation and corpus exploration are needed to better exploit the
huge and valuable corpus resources that can be constructed from
Wikipedia discussions.
The colloquium will bring together researchers that have solid
experience with preparing monolingual (French and German) corpora from
Wikipedia, with their dissemination and providing corpus technology
for their analysis, and with conducting linguistic research on social
interaction in Wikipedia discussions with a particular interest on the
analysis and detection of conflicts.
Goals of the colloquium
The colloquium is committed to the long-term goal of building
comparable French-German discussion corpora as a special type of big
CMC corpora using TEI-compliant standards. These shall serve as a
basis to further develop common tools and methods for the
cross-lingual, corpus-based analyses of interaction, politeness and
conflict.
Objectives concerning corpus building, standards, and tools :
Harmonize the parameters of the so far separate French and German
Wikipedia corpus building processes in order to make them
interoperable for D-F contrastive and cross-lingual analyses : further
develop the standards of the TEI CMC SIG ; align metadata categories
and value taxonomies.
Objectives concerning interaction analyses :
Develop annotation categories for interaction patterns, politeness
cues, and conflict analysis, joint representation of conflict
structures.
Objectives concerning corpus analysis methods :
Develop and adapt corpus-linguistic methods from KorAP and Textométrie
to explore and visualize cross-lingual analyses on Wikipedia
discussion corpora ; prepare the exploration of cross-linguistic
distributional semantics by training word embedding models on the
French and German Wikipedias.