close
Warning:
AdminModule failed with TracError: Unable to instantiate component <class 'trac.admin.web_ui.PluginAdminPanel'> (super(type, obj): obj must be an instance or subtype of type)
- Timestamp:
-
Jan 17, 2016, 10:25:59 PM (10 years ago)
- Author:
-
xkocinc
- Comment:
-
--
Legend:
- Unmodified
- Added
- Removed
- Modified
-
|
v5
|
v6
|
|
| 168 | 168 | [[BR]]The most consuming part if word sketch computation, which depends on the number of relations and complexity of the corpus queries. The indexation phase is not very CPU intensive, log-linear with regard to the number of quintuples due to the sorting, and in terms of speed mainly dependent on the speed of the underlying storage performing I/O operations. The scoring phase is very fast since it facilitates existing indices, and usually takes less than 0.5 % time of the overall computation. |
| 169 | 169 | |
| 170 | | [[BR]]Figure 2: Trivial parallelization |
| | 170 | [[BR]] |
| | 171 | [[Image(fig2.png)]] |
| | 172 | Figure 2: Trivial parallelization |
| 171 | 173 | |
| 172 | 174 | [[BR]]From this follows that clearly any efforts towards speeding up the computation should be devoted to the computation, i.e. query evaluation phase. In (Pomikálek, 2012) we have shown a parallelization approach which, depending on the structure of the word sketch relation, allows close to a linear speed up of this phase with regard to the number of processing cores used. |
| 173 | 175 | |
| 174 | 176 | [[BR]]In natural language processing often parallelization can be done in the most trivial way: by splitting data to be processed into N parts and running N independent tasks (see Figure 2). However, in case of a corpus management system, this is often not possible because of the underlying string-to-number mapping which needs to be consistent across a single corpus, and hence shared during parallel processing. As such, it easily represents a bottleneck severely limiting potential speedup (see Figure 3). In (Jakubíček, 2014) we describe a general mechanism which deals with this issue and is called corpus virtualization. |
| | 177 | |
| | 178 | [[Image(fig3.png)]] |
| 175 | 179 | |
| 176 | 180 | Figure 3: Shared lexicon as a parallelization bottleneck |
| … |
… |
|
| 202 | 206 | In case of two parallel corpora, i.e. corpora with existing alignment on sentence or paragraph level, we have devised algorithms for automatic computation of translation candidates based on such alignment. We then display word sketches for the top translation candidate, with aligned grammatical relations, and also show aligned segments with usages for individual collocations. An example parallel word sketch is provided in Figure 5. |
| 203 | 207 | |
| 204 | | [[BR]][[BR]]Figure 5: Bilingual parallel word sketch |
| | 208 | [[BR]][[BR]] |
| | 209 | [[Image(fig5.png)]] |
| | 210 | Figure 5: Bilingual parallel word sketch |
| 205 | 211 | |
| 206 | 212 | 2. Bilingual comparable word sketch (BIC) |
| … |
… |
|
| 212 | 218 | In the last case, the user is responsible for providing the translation manually, and we then simply show the bilingual word sketch for the given translation. |
| 213 | 219 | |
| 214 | | [[BR]]Figure 4: Multiword sketch for ''young man'' |
| | 220 | [[BR]] |
| | 221 | [[Image(fig4.png)]] |
| | 222 | Figure 4: Multiword sketch for ''young man'' |
| 215 | 223 | |
| 216 | 224 | === Longest-commonest match === |