Language corpora as a source of information for
Biber 1995; Biber & Conrad 2009
Establishing MD model:
Mini-portal https://www.korpus.cz/mda
Cvrček, V. – Komrsková, Z. – Lukeš, D. – Poukarová, P. – Řehořková, A. – Zasina, A.J. (2021): From extra- to intratextual characteristics: Charting the space of variation in Czech through MDA. Corpus Linguistics and Linguistic Theory 17(2), p. 351-382.
Cvrček, V. – Komrsková, Z. – Lukeš, D. – Poukarová, P. – Řehořková, A. – Zasina, A. J. (2018): Variabilita češtiny: multidimenzionální analýza. Slovo a slovesnost 79, (p. 293–321).
Cvrček, V. - Laubeová, Z. - Lukeš, D. - Poukarová, P. - Řehořková, A. - Zasina, A. J. - Benko, V. (2020): Comparing web-crawled and traditional corpora. Language Resources & Evaluation 54, p. 713–745.
Cvrček, V. – Laubeová, Z. – Lukeš, D. – Poukarová, P. – Řehořková, A. – Zasina, A. J. (2020): Registry v češtině. Praha: Nakladatelství Lidové noviny, (233 p.).
wri
, spo
, web
Category | # |
---|---|
Tokens | 10,8 M |
Words (excl. punct.) | 9 M |
Lemmata (types) | 204 K |
Text chunks | 3 334 |
Originally 140+ features, final list 122, e.g.:
Positive loading features:
Negative loading features:
Positive loading features:
Negative loading features:
Additive MDA (cf. Berber Sardinha et al. 2019, 165):
\(\Rightarrow\) register information for any text of a language
Register
Static registers
Dynamic registers
Average position of texts within the narrative cluster (9)
Usage based approach to style labels:
\(\Rightarrow\) using frequency and association measures
Instead of co-occurrence of words x and y we work with word x and its presence in texts of register R:
\[\text{MI-score} = \log_2 \frac{f(xy) \times N}{f(x)\times f(y)} \Rightarrow \log_2 \frac{f(xR) \times N}{f(x)\times f(R)}\]
\[\text{logDice} = 14 + \frac{2f(xy)}{f(x) + f(y)} \Rightarrow 14 + \frac{2f(xR)}{f(x) + f(R)}\]
where:
lemma | Register | logDice |
---|---|---|
a ‘and’ | analysis | 9.35 |
v ‘in’ | analysis | 9.24 |
být ‘be’ | analysis | 8.70 |
ten ‘this’ | question answering | 10.32 |
být ‘be’ | question answering | 9.50 |
že ‘that’ | question answering | 9.38 |
a ‘and’ | argumentation | 9.57 |
být ‘be’ | argumentation | 9.33 |
se ‘refl. pron.’ | argumentation | 9.23 |
logDice: too common/indefinite
lemma | Register | MI |
---|---|---|
kontrolér ‘inspector’ | analysis | 3.52 |
honitba ‘hunt’ | analysis | 3.52 |
skartační ‘shredding’ | analysis | 3.51 |
eeh ‘filler’ | question answering | 4.75 |
nó ‘well’ | question answering | 4.73 |
něak ‘somehow’ | question answering | 4.69 |
svěřenský ‘trust’ | argumentation | 2.81 |
zatáčení ‘turning’ | argumentation | 2.81 |
přitažení ‘draging’ | argumentation | 2.80 |
MI-score: too specialized
source: Koditex
Number of texts (chunks) instead of occurrences/frequencies
\[\text{MI-score} = \log_2 \frac{texts(xR) \times texts(corpus)}{texts(x)\times texts(R)}\]
\[\text{logDice} = 14 + \frac{2\times texts(xR)}{texts(x) + texts(R)}\]
where:
lemma | Register | logDice |
---|---|---|
příslušný ‘relevant’ | analysis | 12.93 |
uvedený ‘mentioned’ | analysis | 12.85 |
stanovený ‘stated’ | analysis | 12.80 |
ee ‘filler’ | question answering | 13.28 |
tenhleten ‘that’ | question answering | 13.03 |
ňák ‘somehow’ | question answering | 12.99 |
důsledek ‘consequence’ | argumentation | 12.77 |
proces ‘process’ | argumentation | 12.68 |
význam ‘meaning’ | argumentation | 12.60 |
logDice: usable (?)
lemma | Register | MI |
---|---|---|
rezistor ‘resistor’ | analysis | 3.46 |
mV ‘millivolt’ | analysis | 3.45 |
tiskopis ‘print’ | analysis | 3.45 |
eeh’filler’ | question answering | 4.46 |
feminizace ‘feminisation’ | question answering | 4.46 |
něak ‘somehow’ | question answering | 4.42 |
přitažení ‘draging’ | argumentation | 2.77 |
decentralizace ‘decentralization’ | argumentation | 2.72 |
harmonizovaný ‘harmonized’ | argumentation | 2.72 |
MI: still problematic
source: Koditex
Applied to SYN2015 (100m corpus of written Czech)
Argumentation (static cohesive):
pojem ‘concept’, důsledek ‘consequence’, teorie ‘theory’, proces ‘process’, obecný ‘general’, jev ‘phenomenon’, daný ‘given’, určitý ‘certain’, princip ‘principle’, -li ‘if’, příklad ‘example’, hledisko ‘viewpoint’, aspekt ‘aspect’, předpoklad ‘assumption’, obecně ‘general (adv.)’
Screenplay (dynamic with addressee coding):
teda ‘then’, prominout ‘sorry’, prdel ‘shit/ass’, hele ‘hey’, jo ‘yeah’, sakra ‘damn’, kurva ‘fuck/whore’, kouknout ‘look’, viď ‘see/right’, dneska ‘today’ tvůj ‘your’, hm ‘huh’, aha ‘oh’, koukat ‘stare’, ahoj ‘hi’
Narration (dynamic retrospective):
zeptat ‘ask’, ty ‘you’, tvář ‘face’, dveře ‘door’, oko ‘eye’, rameno ‘shoulder’, slyšet ‘hear’, vlas ‘hair’, tvůj ‘your’, odpovědět ‘answer’, hlas ‘voice’, sedět ‘sit’, zvednout ‘raise’, tenhle ‘this’, usmát ‘smile’
Labels in Akademický slovník současné češtiny (Academic dictionary of contemporary Czech) – currently compiled at the CLI (words: A–G)
Style characteristics:
RQ: Where can we typically find them?
Data: Koditex (SYN2015)
Which register has the strongest association with words labeled as
colloquial
(source: Koditex)?
n = 129
Which register has the strongest association with words labeled as
colloquial with a tendency to become neutral
(source:
Koditex)?
(source: Koditex)
Where are these expressions used in SYN2015?
Data-driven (descriptive) approach: