PDT 1 Zdeněk Žabokrtský Czech Technical University, Department of Computer Science the following presentation can be downloaded from
PDT 2 The Prague Dependency Treebank (PDT) long-term project aimed at a complex annotation of a part of the Czech National Corpus with rich annotation scheme Institute of Formal and Applied Linguistics –established in 1990 at the Faculty of Mathematics and Physics, Charles University, Prague –Jan Hajič, Eva Hajičová, Jarmila Panevová, Petr Sgall, … –
PDT 3 The Prague Dependency Treebank inspiration: –the Penn Treebank (the most widely used syntactically annotated corpus of English) motivation: – the treebank can obviously be used for further linguistic research –more accurate results can be obtained when using annotated corpora than when using texts in their raw form (unsupervised training)
PDT 4 Source of the text data provided by Institute of the Czech National Corpus (ICNC) text sample for PDT – tokens (words and punctuations) in sentences, divided into 576 files, 50 sentences per file 40 % - general newspaper articles (Lidové noviny, Mladá Fronta) 20 % - economic new and analysis (Českomoravský profit) 20 % - popular science magazine (Vesmír) 20 % - information technology texts –divided into a training set ( sentences) a development test set (3 697) a cross-evaluation test data set (3 787)
PDT 5 Institute of the Czech National Corpus founded 1994 at the Faculty of Philosophy, Charles University, head of the institute: prof. František Čermák 100 million words freely accessible: –querry language CQP (corpus query processor, developed at the university in Stuttgart) –regular expressions –examples of querries: disku[s|z]e.+nést
PDT 6 CNC: querry example querry:.+nosit response: tačí se trochu vybavit, kupu listí a sena - já ho ie Každý mistr by se měl nějakým rekordem či jedin anční tísni by měly dítě. Bezvýhradná povinnost p í hladovění bude schopna plod. Mimochodem i u sou evítané těhotenství tzv. a dítěte se vzdát ve pros mž sedíme, nepostavil. tuny kamení na zádech, t byl v nebezpečí a naděje dítě žádná. Jeden večer 6 - Živit mateř. mlékem 57 - Ukončit létání 58 - odstatně větší a může se řadou úctyhodných přívlas vy, v pokoji nekouřit, domů alkohol. Dodržovat ve městě, které se mělo jen svým " dělnickým hnut...
PDT 7 Layered structure of PDT morphological level –full morphological tagging (word forms, lemmas, mor. tags) analytical level –surface syntax –syntactic annotation using depencency syntax (captures analytical functions such as Subject, Object,...) tectogrammatical level –level of linguistic meaning (tectogrammatical functions such as Actor, Patient,...) raw text morphologically tagged text analytic tree structures (ATS) tectogrammatical tree structures (TGTS)
PDT 8 The Morphological Level a tag and a lemma are assigned to each word form from the input text 3030 tags (Czech is an inflectionally rich language) 6 tag variables –number - degrees of comparison –case - person –gender- negation example: – VPS3A - verb (indicative, present tense, sing., 3rd person, affirmative)
PDT 9 Morphological Analysis an automatic process: –input: word form –output: a set of possible lemmas, each lemma accompanied by a set of possible tags currently covers Czech lemmas, based on stems can recognize 20 million word forms output ambiguity: – there may be 5 different lemmas for a given word form –27 different tags for a given lemma –example: učení - NNS1A, NNS2A, NNS3A,...,NNP5A
PDT 10 The whole process of morphological tagging automatic morphological analysis manual disambiguation –2 annotators –in the full text context –special software tool automatic comparison manual correction raw text unambiguously tagged text
PDT 11 Data Format Standard Generalized Markup Language (SGML) a sample of DTD (Document Type Definition) related to the morphological level: <!ELEMENT MMl - O (#PCDATA & R? & E? & e? & T* & MMt*) -- lemma (base form), description see the l tag; machine assigned (by a morphological analysis program), NOT disambiguated --> <!ELEMENT MDl - O (#PCDATA & R? & E? & e? & T* & MDt*) -- lemma (base form), description see the l tag; machine assigned (by a tagger), disambiguated if more than 1: n-best -->... <!ELEMENT MMt - O (#PCDATA) -- morphological tag(s) as assigned by morphology, NOT disambiguated --> <!ELEMENT MDt - O (#PCDATA) -- morphological tag(s) as assigned by machine, disambiguated, possibly also with weight/prob; if more than 1: n-best -->
PDT 12 Example of tagged sentence Ty mají pak někdy takovou publicitu, že to dotyčnou kancelář zlikviduje. Ty ty PP2S1 PP2S5 ten PDFP1 PDFP4 PDIP1 PDIP4 PDMP4 Sb 1 2 mají mít VPP3A Pred 2 0 pak pak DB Adv 3 2 někdy někdy DB Adv 4 2 takovou takový AFS41A AFS71A Atr 5 6 publicitu publicita NFS4A Obj 6 2,, ZIP AuxX 7 8 že že JS AuxC 8 6 to ten PDNS1 PDNS4 Sb 9 13 dotyčnou dotyčný AFS41A AFS71A Atr kancelář kancelář NFS1A NFS4A Obj prakticky prakticky_^(*1ý) DG1A Adv zlikviduje zlikvidovat_:W VPS3A Obj ZIP AuxK 14 0
PDT 13 The Analytical Level the dependency structure was chosen to represent the syntactic relations within the sentence. output of the analytical level: analytical tree structure (ATS) –oriented, acyclic graph with one entry node –every word form and punctuation mark is represented as a node –the nodes are annotated by attribute-value pairs new attribute: analytical function –determines the relation between the dependent node and its governing nodes –values: Sb, Obj, Adv, Atr,....
PDT 14 Example of ATS V návrzích na případné změny vycházejí ze svých většinou několikaletých podnikatelských zkušeností.
PDT 15 Selected attributes of ATS’s nodes
PDT 16 Selected values of the analytical function
PDT 17 Example of tagged sentence...ve sledovaném období žádný okres nezlepšil svoji pozici... ve v RV4 RV6 AuxP 4 9 sledovaném sledovaný_^(*2t) AIS61A AMS61A A NS61A Atr 5 6 období období NNP1A NNP2A NNP4A N NP5A NNS1A NNS2A NNS3A NNS4A N NS5A NNS6A Adv 6 4 žádný ľádný PNFIS4 PNFYS1 PNFYS5 Atr 7 8 okres okres NIS1A NIS4A Sb 8 9 nezlepšil zlepąit_:W VRYSN Pred_Co 9 11 pozici pozice NFS3A NFS4A NFS6A Obj 1 0 9
PDT 18 The Tectogrammatical Level based on the framework of the Functional Generative Description as developed by Petr Sgall in comparison to the ATSs, the tectogrammatical tree structures (TGTSs) have the following characteristics: –only autosemantic words have an own node, function words (conjunctions, prepositions) are attached as indices to the autosemantic words to which they belong –nodes are added in case of clearly specified deletions on the surface level –analytical functions are substituted by tectogrammatical functions (functors), such as Actor, Patient, Addressee,...
PDT 19 Example of TGTS Podle předběžných odhadů se totiž počítá, že do soukromého vlastnictví bude prodáno minimálne bytů
PDT 20 Selected attributes of a TGTS‘s node
PDT 21 Functors tectogrammatical counterparts of analytical functions about 40 functors in 2 groups: –actants Actor, Patient, Adressee, Origin, Effect –free modifiers LOC, DIR1, RSTR, TWHEN, TTIL,... provide more detailed information about the relation to the governing node than the analytical function
PDT 22 Example of ATS... Kdo chce investovat dvě stě tisíc korun do nového automobilu, nelekne se, že benzín byl změnou zákona trochu zdražen.
PDT and the corresponding TGTS
PDT 24 Tectogrammatical tagging 2 parallel streams ATS treebank smaller set of fully tagged TGTSs larger set of partially tagged TGTSs (only changes of tree structure, functor and TFA assignment)
PDT 25 Problems of automatic functor assignment za roh - DIR3 za hodinu - TWHEN za svobodu - OBJ po otci –TWHEN (Přišel po otci.) –NORM (Jmenuje se po otci.) –HER (Zdědil dům po otci.) –...
PDT 26 Summary the current state of art: –there are several manually annotated files of TGTSs –methods for automatic transformation from ATS into TGTS form are in development Czech National Corpus morphologically tagged corpus ATS treebank TGTS treebank September, 1994 November, 1996 March, 2000