| | 2 | |
| | 3 | == Building the Tigrinya Web corpus == |
| | 4 | |
| | 5 | The Building of the corpus is described at [[TigrinyaCorpus]]. |
| | 6 | |
| | 7 | == Corpus properties == |
| | 8 | Basic properties of corpus sources are summarised below. |
| | 9 | |
| | 10 | The size of corpus structures: |
| | 11 | ||=Document count =|| 1,907|| |
| | 12 | ||=Paragraph count =|| 28,552|| |
| | 13 | ||=Sentence count =|| 139,357|| |
| | 14 | ||=Token count =|| 2,531,443|| |
| | 15 | ||=Ge'ez script lexicon size =|| 225,132|| |
| | 16 | ||=Sera transliteration lexicon size =|| 220,935|| |
| | 17 | |
| | 18 | == Morphological annotation == |
| | 19 | |
| | 20 | The corpus was tagged by !TreeTagger with a model trained on the cleaned version of the WIC Corpus [1]. |
| | 21 | |
| | 22 | Manual evaluation of the tagging accuracy will be done in the next phase of the HaBiT project. |
| | 23 | |
| | 24 | |
| | 25 | === Tag-set === |
| | 26 | |
| | 27 | ||=Basic Class=||=Definition of the tag=||=Tag=|| |
| | 28 | ||Noun||Verbal/infinitival noun, formed from any verb|| VN || |
| | 29 | || ||Noun attached with a preposition|| NP || |
| | 30 | || ||Noun attached with conjunction|| NC || |
| | 31 | || ||Noun with a proclitic preposition and an enclitic conjunction|| NPC || |
| | 32 | || ||Any other noun|| N || |
| | 33 | ||Pronoun||Pronoun attached with preposition|| PRONP || |
| | 34 | || ||Pronoun attached with conjunction|| PRONC || |
| | 35 | || ||Pronoun with a proclitic preposition and an enclitic conjunction|| PRONPC || |
| | 36 | || ||Any other Pronoun|| PRON || |
| | 37 | ||Verb||Auxiliary verb|| AUX || |
| | 38 | || ||Relative verb|| VREL || |
| | 39 | || ||Verb attached with preposition|| VP || |
| | 40 | || ||Verb attached with conjunction|| VC || |
| | 41 | || ||Verb with a proclitic preposition and an enclitic conjunction|| VPC || |
| | 42 | || ||Verb (all other)|| V || |
| | 43 | ||Adjective||Adjective attached with preposition|| ADJP || |
| | 44 | || ||Adjective attached with conjunctions|| ADJC || |
| | 45 | || ||Adjective with a proclitic preposition and an enclitic conjunction|| ADJPC || |
| | 46 | || ||Any other Adjective|| ADJ || |
| | 47 | ||Preposition||Preposition|| PREP || |
| | 48 | ||Conjunction||Conjunction|| CONJ || |
| | 49 | ||Adverb||Adverb|| ADV || |
| | 50 | ||Numeral||Cardinal|| NUMCR || |
| | 51 | || ||Ordinal|| NUMOR || |
| | 52 | || ||Numeral attached with preposition|| NUMP || |
| | 53 | || ||Numeral attached with conjunction|| NUMC || |
| | 54 | || ||Numeral with aproclitic preposition and an enclitic conjunction|| NUMPC || |
| | 55 | ||Interjection||Interjections|| INT || |
| | 56 | ||Punctuation||Punctuation|| PUNC || |
| | 57 | ||Unclassified||Unclassified|| UNC || |
| | 58 | |
| | 59 | === Tag frequencies === |
| | 60 | |
| | 61 | The most frequent parts of speech in the corpus are nouns. The most frequent part of speech tags: |
| | 62 | ||=Part of speech tag =||=Token count =|| |
| | 63 | ||N|| 1,676,460|| |
| | 64 | ||PUNC|| 135,685|| |
| | 65 | ||NP|| 135,676|| |
| | 66 | ||SENT|| 116,574|| |
| | 67 | ||V|| 106,615|| |
| | 68 | ||NUMCR|| 91,516|| |
| | 69 | ||VP|| 62,990|| |
| | 70 | ||NC|| 60,589|| |
| | 71 | ||ADJ|| 56,009|| |
| | 72 | ||VN|| 21,778|| |
| | 73 | ||VREL|| 16,530|| |
| | 74 | ||CONJ|| 11,953|| |
| | 75 | ||ADV|| 10,193|| |
| | 76 | ||PREP|| 7,444|| |
| | 77 | ||NPC|| 5,954|| |
| | 78 | ||UNC|| 5,723|| |
| | 79 | ||VPC|| 4,532|| |
| | 80 | ||ADJC|| 1,455|| |
| | 81 | ||PRON|| 1,219|| |
| | 82 | ||ADJPC|| 1,016|| |
| | 83 | |