-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpagecontent_omnius.xsd
618 lines (557 loc) · 25 KB
/
pagecontent_omnius.xsd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
<?xml version="1.0" encoding="utf-8"?>
<xs:schema
targetNamespace="https://schema.omnius.com/pagesformat/2022.03.01"
xmlns="https://schema.omnius.com/pagesformat/2022.03.01"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
elementFormDefault="qualified">
<xs:annotation>
<xs:documentation>
This Schema defines the omni:us Pages Format (OPF), intended for storing
data associated to one or more images of digitized documents.
This format is designed to be very general so that it can be used for
several purposes, such as storing ground truth data or results of
automatic recognition processes. It is also intended to store many
possible levels of data, such as layout and transcripts at region,
line, word and glyph levels.
General structure
=================
An OPF XML is composed of one or more Page elements, each corresponding to
a page of a document. The Page elements have an imageFilename attribute
which links to a file from where to get the corresponding page image. This
could be a path (absolute or relative to the location of the XML) in the
local machine or a URL to retrieve it remotely. The file should normally
be an image or a PDF file. In case of multipage images or PDFs, the
imageFilename should end with the page number within square brackets being
the page number 0 based (ImageMagick notation). The Page elements also
have attributes for the width and height of the page, which defines the
scale for all coordinates within this page.
Within each page there can be words, text lines and different kinds of
regions (text, table, image, separator and custom). Within table regions
there can be words, text lines, text regions and separator regions. Within
text regions there can be words and text lines. Within text lines there
can only be words. And finally within words there can only be glyphs. All
of these elements can optionally have a Coords element which defines its
geometry. In the case of text lines, there is an additional geometry
element which is its baseline. The elements that relate to text, i.e.,
text regions, text lines, words and glyphs, can have TextEquiv elements
that contains its corresponding text content. A single element may have
more than one TextEquiv element to store for instance multiple
recognition hypotheses, though all of them must have a unique type
attribute.
Most elements can have one or more property elements, though their keys
are required to be unique. These are intended to store generic information,
for example whether an element is of a given class.
As children of the root PcGts node apart from Page elements there can
also be Groups. A Group encodes a relationship between multiple elements.
For example, there could be text regions for paragraphs and a group to
indicate all paragraphs that belong to a single column. Another example
could be that in a document there is some text representing a key and some
other text its value, a relationship which could be encoded with a Group.
All elements that are a result of an automatic recognition can and should
have a conf attribute indicating the confidence of this recognition.
Reading order
-------------
The reading order of words within lines, lines within regions, pages
within the document, etc. is implied by the order of the elements in the
XML. If lines in a region do not have an explicit reading order, then
they shouldn't be part of the same region, for example a paragraph and a
side note.
Text lines
----------
For the purpose of this format, a text line is defined as a series of
words that are next to each other without overlapping along the reading
direction. For latin languages this would be words from left to right
that are all at the same vertical position and that the next word always
starts more to the right than the end of the previous word. This means
that if in some text a word was missing and later written between two
lines, since there was no space for it, then this inserted word is
considered as a different line. If there is an insertion symbol, then
this symbol should be added to the corresponding line transcript where
it visibly appears.
Baselines
---------
The order of the points in the baselines should always be left-to-right
for both text which is read left-to-right and right-to-left.
Coordinates
-----------
The Coords of elements defining its geometry in general can be any polygon
without restrictions. However, similar to the baselines, the order of the
points normally is used to encode directionality. The first point should
be the top-left corner and the rest be in clockwise order. The top-left is
with respect to the actual orientation of the respective content, not the
orientation of the page. For example, a text line which is vertical in the
page, reading from bottom to top, its top-left would be the point closest
to the bottom-left of the page.
Additional non-schema requirements and recommendations
======================================================
To prevent development errors and ease debugging of problems, a few
requirements and recommendations described next should be followed.
- All processing tools should at least validate input/output XMLs
against this schema.
- All XMLs should be in utf-8 encoding.
- The default namespace of the XMLs should be the OPF namespace.
- Attributes of the XML should be sorted alphabetically. This is so that
when comparing versions of the XML, differences can be observed at a
word/token level and ease viewing only the things that did change.
- When pretty printing the XML, it should be indented using 2 spaces and
elements without content should always be self-closed.
Relation to PRImA PAGE XML
==========================
The format is highly based on the
[PRImA PAGE XML format](https://github.com/PRImA-Research-Lab/PAGE-XML)
for page content. The names of the XML elements are mostly the same. The
most notable differences are the following:
- It allows to include information about a whole document, not just a
single page, i.e., the XMLs can contain multiple Page elements.
- Only a subset of features from PRImA PAGE XML is allowed.
- Since many region types have been removed, as alternative there is a new
CustomRegion with a type attribute.
- Coordinates are allowed to be floating point numbers, not just integers.
- It defines Property elements which can be used for diverse generic
purposes.
- TextLines are allowed as children of Page and TableRegion elements, words
are allowed as children of Page, TextRegion and TableRegion elements, and
TextRegions can also be children of TableRegions.
- There is a generic Group element stored as children of the PcGts
root node which references one or more IDs that can reference any
element type.
- Page elements optionally can have an ID attribute.
- Most elements allow a setBy attribute to identify who/what created it.
- Process elements can be added to the Metadata to document details of
processes run on the document.
</xs:documentation>
</xs:annotation>
<xs:element name="PcGts">
<xs:annotation>
<xs:documentation>
Root node for the pages content.
</xs:documentation>
</xs:annotation>
<xs:complexType>
<xs:sequence>
<xs:element name="Metadata" type="MetadataType"/>
<xs:element name="Property" type="PropertyType" minOccurs="0" maxOccurs="unbounded"/>
<xs:element name="Page" type="PageType" minOccurs="1" maxOccurs="unbounded"/>
<xs:element name="Group" type="GroupType" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="id" type="xs:ID" use="optional"/>
</xs:complexType>
</xs:element>
<xs:complexType name="MetadataType">
<xs:annotation>
<xs:documentation>
Basic information about the XML file.
</xs:documentation>
</xs:annotation>
<xs:sequence>
<xs:element name="Creator" type="NotEmptyType"/>
<xs:element name="Created" type="xs:dateTime"/>
<xs:element name="LastChange" type="xs:dateTime"/>
<xs:element name="Process" type="ProcessType" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>
<xs:complexType name="ProcessType">
<xs:annotation>
<xs:documentation>Element for information of a process performed on the document.</xs:documentation>
</xs:annotation>
<xs:attribute name="id" type="xs:ID" use="optional">
<xs:annotation>
<xs:documentation>
Short identifier of the process. For simplicity in general it could
be ps1, ps2, etc.
</xs:documentation>
</xs:annotation>
</xs:attribute>
<xs:attribute name="started" type="xs:dateTime" use="required">
<xs:annotation>
<xs:documentation>
Date and time in UTC of when the process started using the XSD
dateTime format, i.e., YYYY-MM-DDThh:mm:ssZ.
</xs:documentation>
</xs:annotation>
</xs:attribute>
<xs:attribute name="time" type="xs:float" use="required">
<xs:annotation>
<xs:documentation>
Total time the process took in seconds.
</xs:documentation>
</xs:annotation>
</xs:attribute>
<xs:attribute name="tool" type="NotEmptyType" use="required">
<xs:annotation>
<xs:documentation>
Name of the tool used to process, its version, platform it was run
on, configuration values, etc.
</xs:documentation>
</xs:annotation>
</xs:attribute>
<xs:attribute name="ref" type="NotEmptyType" use="optional">
<xs:annotation>
<xs:documentation>
Reference to the tool execution instance. It should allow locating the
output logs. For human edits, it should allow to determine the person.
</xs:documentation>
</xs:annotation>
</xs:attribute>
</xs:complexType>
<xs:complexType name="PageType">
<xs:annotation>
<xs:documentation>
Element to represent a single page of a document.
</xs:documentation>
</xs:annotation>
<xs:sequence>
<xs:element name="ImageOrientation" type="ImageOrientationType" minOccurs="0" maxOccurs="1"/>
<xs:element name="Property" type="PropertyType" minOccurs="0" maxOccurs="unbounded"/>
<xs:choice minOccurs="0" maxOccurs="unbounded">
<xs:element name="Word" type="WordType"/>
<xs:element name="TextLine" type="TextLineType"/>
<xs:element name="TextRegion" type="TextRegionType"/>
<xs:element name="TableRegion" type="TableRegionType"/>
<xs:element name="ImageRegion" type="ImageRegionType"/>
<xs:element name="SeparatorRegion" type="SeparatorRegionType"/>
<xs:element name="CustomRegion" type="CustomRegionType"/>
</xs:choice>
</xs:sequence>
<xs:attribute name="id" type="xs:ID" use="optional"/>
<xs:attribute name="imageFilename" type="NotEmptyType" use="required">
<xs:annotation>
<xs:documentation>
A path in the local machine absolute or relative to the location
of the XML, or a URL from where to get the file. If the file corresponds
to a multipage image or PDF, then the path/URL must be followed by the
page number (0 based) within square brackets. For example `file.pdf[0]`.
</xs:documentation>
</xs:annotation>
</xs:attribute>
<xs:attribute name="imageWidth" type="xs:int" use="required"/>
<xs:attribute name="imageHeight" type="xs:int" use="required"/>
</xs:complexType>
<xs:complexType name="GroupType">
<xs:annotation>
<xs:documentation>
Element for defining general groups of elements.
</xs:documentation>
</xs:annotation>
<xs:sequence>
<xs:element name="Property" type="PropertyType" minOccurs="0" maxOccurs="unbounded"/>
<xs:element name="Member" type="MemberType" minOccurs="1" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="conf" type="ConfType" use="optional"/>
<xs:attribute name="setBy" type="NotEmptyType" use="optional"/>
<xs:attribute name="id" type="xs:ID" use="required"/>
</xs:complexType>
<xs:complexType name="MemberType">
<xs:annotation>
<xs:documentation>
Element for specifying a member of a group.
</xs:documentation>
</xs:annotation>
<xs:attribute name="conf" type="ConfType" use="optional"/>
<xs:attribute name="ref" type="xs:IDREF" use="required">
<xs:annotation>
<xs:documentation>
ID of the element that is member.
</xs:documentation>
</xs:annotation>
</xs:attribute>
</xs:complexType>
<xs:complexType name="TextLineType">
<xs:annotation>
<xs:documentation>
Element to represent a single line of text in a document.
</xs:documentation>
</xs:annotation>
<xs:sequence>
<xs:element name="Property" type="PropertyType" minOccurs="0" maxOccurs="unbounded"/>
<xs:element name="Coords" type="CoordsType" minOccurs="0" maxOccurs="1"/>
<xs:element name="Baseline" type="BaselineType" minOccurs="0" maxOccurs="1"/>
<xs:element name="Word" type="WordType" minOccurs="0" maxOccurs="unbounded"/>
<xs:element name="TextEquiv" type="TextEquivType" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="id" type="xs:ID" use="required"/>
</xs:complexType>
<xs:complexType name="WordType">
<xs:annotation>
<xs:documentation>
Element to represent a single word or token in a document.
</xs:documentation>
</xs:annotation>
<xs:sequence>
<xs:element name="Property" type="PropertyType" minOccurs="0" maxOccurs="unbounded"/>
<xs:element name="Coords" type="CoordsType" minOccurs="0" maxOccurs="1"/>
<xs:element name="Glyph" type="GlyphType" minOccurs="0" maxOccurs="unbounded"/>
<xs:element name="TextEquiv" type="TextEquivType" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="id" type="xs:ID" use="required"/>
</xs:complexType>
<xs:complexType name="GlyphType">
<xs:annotation>
<xs:documentation>
Element to represent a single character or symbol in a document.
</xs:documentation>
</xs:annotation>
<xs:sequence>
<xs:element name="Property" type="PropertyType" minOccurs="0" maxOccurs="unbounded"/>
<xs:element name="Coords" type="CoordsType" minOccurs="0" maxOccurs="1"/>
<xs:element name="TextEquiv" type="TextEquivType" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="id" type="xs:ID" use="required"/>
</xs:complexType>
<xs:complexType name="RegionType" abstract="true">
<xs:annotation>
<xs:documentation>
Abstract region type inherited by all other region types.
</xs:documentation>
</xs:annotation>
<xs:sequence>
<xs:element name="Property" type="PropertyType" minOccurs="0" maxOccurs="unbounded"/>
<xs:element name="Coords" type="CoordsType" minOccurs="0" maxOccurs="1"/>
</xs:sequence>
<xs:attribute name="id" type="xs:ID" use="required"/>
<xs:attribute name="orientation" type="AngleType" use="optional">
<xs:annotation>
<xs:documentation>
Angle that the region has to be rotated in clockwise direction
in order to correct the skew (negative values indicate
anti-clockwise rotation). Range: (-180,180].
</xs:documentation>
</xs:annotation>
</xs:attribute>
</xs:complexType>
<xs:complexType name="TextRegionType">
<xs:annotation>
<xs:documentation>
Element to represent a region of text in a document.
</xs:documentation>
</xs:annotation>
<xs:complexContent>
<xs:extension base="RegionType">
<xs:sequence>
<xs:element name="Word" type="WordType" minOccurs="0" maxOccurs="unbounded"/>
<xs:element name="TextLine" type="TextLineType" minOccurs="0" maxOccurs="unbounded"/>
<xs:element name="TextEquiv" type="TextEquivType" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="readingDirection" type="ReadingDirectionType" use="optional"/>
</xs:extension>
</xs:complexContent>
</xs:complexType>
<xs:complexType name="ImageRegionType">
<xs:annotation>
<xs:documentation>
Region defining any kind of graphical element, e.g. photo, diagram, etc.
</xs:documentation>
</xs:annotation>
<xs:complexContent>
<xs:extension base="RegionType"/>
</xs:complexContent>
</xs:complexType>
<xs:complexType name="TableRegionType">
<xs:annotation>
<xs:documentation>
Tabular data in any form is represented with a table
region. Rows and columns may or may not have separator
lines.
</xs:documentation>
</xs:annotation>
<xs:complexContent>
<xs:extension base="RegionType">
<xs:sequence>
<xs:element name="Word" type="WordType" minOccurs="0" maxOccurs="unbounded"/>
<xs:element name="TextLine" type="TextLineType" minOccurs="0" maxOccurs="unbounded"/>
<xs:element name="TextRegion" type="TextRegionType" minOccurs="0" maxOccurs="unbounded"/>
<xs:element name="SeparatorRegion" type="SeparatorRegionType" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="rows" type="xs:int" use="optional">
<xs:annotation>
<xs:documentation>
The number of rows present in the table.
</xs:documentation>
</xs:annotation>
</xs:attribute>
<xs:attribute name="columns" type="xs:int" use="optional">
<xs:annotation>
<xs:documentation>
The number of columns present in the table.
</xs:documentation>
</xs:annotation>
</xs:attribute>
</xs:extension>
</xs:complexContent>
</xs:complexType>
<xs:complexType name="SeparatorRegionType">
<xs:annotation>
<xs:documentation>
Separators are lines or some kind of graphic that lies between columns,
rows, sections, etc. that logically separates different parts in a
document.
</xs:documentation>
</xs:annotation>
<xs:complexContent>
<xs:extension base="RegionType"/>
</xs:complexContent>
</xs:complexType>
<xs:complexType name="CustomRegionType">
<xs:annotation>
<xs:documentation>
Regions containing content that is not covered by the default types.
</xs:documentation>
</xs:annotation>
<xs:complexContent>
<xs:extension base="RegionType">
<xs:attribute name="type" type="NotEmptyType" use="optional">
<xs:annotation>
<xs:documentation>
Information on the type of content represented by this region.
</xs:documentation>
</xs:annotation>
</xs:attribute>
</xs:extension>
</xs:complexContent>
</xs:complexType>
<xs:complexType name="TextEquivType">
<xs:annotation>
<xs:documentation>
Element to store the corresponding text for some element.
</xs:documentation>
</xs:annotation>
<xs:sequence>
<xs:element name="Property" type="PropertyType" minOccurs="0" maxOccurs="unbounded"/>
<xs:element name="Unicode" type="NotEmptyType">
<xs:annotation>
<xs:documentation>
Corresponding text in Unicode. If present, it is not allowed to be an
empty string. If there is a need for empty text, then a special symbol
should be used, e.g. "<empty/>".
</xs:documentation>
</xs:annotation>
</xs:element>
</xs:sequence>
<xs:attribute name="type" type="NotEmptyType" use="optional">
<xs:annotation>
<xs:documentation>
The type of text, used to differentiate between multiple in the same
element. For example, for n-best recognitions the type could be set
as 'best1', 'best2', etc.
</xs:documentation>
</xs:annotation>
</xs:attribute>
<xs:attribute name="conf" type="ConfType" use="optional"/>
<xs:attribute name="setBy" type="NotEmptyType" use="optional"/>
</xs:complexType>
<xs:complexType name="PropertyType">
<xs:annotation>
<xs:documentation>
A property is a key/value pair storing information about the element it is in.
Attributes are inherited to their child element and can be overidden there.
</xs:documentation>
</xs:annotation>
<xs:attribute name="key" use="required">
<xs:simpleType>
<xs:restriction base="xs:string">
<xs:pattern value="[a-zA-Z0-9_.-]+"/>
</xs:restriction>
</xs:simpleType>
</xs:attribute>
<xs:attribute name="value" type="NotEmptyType" use="optional"/>
<xs:attribute name="conf" type="ConfType" use="optional"/>
<xs:attribute name="setBy" type="NotEmptyType" use="optional"/>
</xs:complexType>
<xs:complexType name="ImageOrientationType">
<xs:annotation>
<xs:documentation>
The angle the image has to be rotated in clockwise direction to make the
page straight (accepted values are: -90, 0, 90, 180). In case the
orientation was the result of some automatic process, the corresponding
confidence value should be included.
</xs:documentation>
</xs:annotation>
<xs:attribute name="angle" use="required">
<xs:simpleType>
<xs:restriction base="xs:string">
<xs:pattern value="(-90|0|90|180)"/>
</xs:restriction>
</xs:simpleType>
</xs:attribute>
<xs:attribute name="conf" type="ConfType" use="optional"/>
<xs:attribute name="setBy" type="NotEmptyType" use="optional"/>
</xs:complexType>
<xs:complexType name="CoordsType">
<xs:annotation>
<xs:documentation>
List of points defining a polygon.
</xs:documentation>
</xs:annotation>
<xs:attribute name="points" type="PointsType" use="required"/>
<xs:attribute name="conf" type="ConfType" use="optional"/>
<xs:attribute name="setBy" type="NotEmptyType" use="optional"/>
</xs:complexType>
<xs:complexType name="BaselineType">
<xs:annotation>
<xs:documentation>
List of points defining a polyline marking the baseline of some text.
</xs:documentation>
</xs:annotation>
<xs:attribute name="points" type="PointsType" use="required"/>
<xs:attribute name="conf" type="ConfType" use="optional"/>
<xs:attribute name="setBy" type="NotEmptyType" use="optional"/>
</xs:complexType>
<xs:simpleType name="PointsType">
<xs:annotation>
<xs:documentation>
List of x,y coordinates as floats with format "x1,y1 x2,y2 ...".
</xs:documentation>
</xs:annotation>
<xs:restriction base="xs:string">
<xs:pattern value="([-.0-9]+,[-.0-9]+ )+([-.0-9]+,[-.0-9]+)"/>
</xs:restriction>
</xs:simpleType>
<xs:simpleType name="AngleType">
<xs:annotation>
<xs:documentation>
Angle in the range (-180,180] degrees.
</xs:documentation>
</xs:annotation>
<xs:restriction base="xs:float">
<xs:minExclusive value="-180"/>
<xs:maxInclusive value="180"/>
</xs:restriction>
</xs:simpleType>
<xs:simpleType name="ConfType">
<xs:annotation>
<xs:documentation>
Confidence value associated to an automatic recognition process. For
manual edits it could also be used for specifying a trust level for
the person that made the edit. It is limited to the range zero to one,
both inclusive.
</xs:documentation>
</xs:annotation>
<xs:restriction base="xs:float">
<xs:minInclusive value="0"/>
<xs:maxInclusive value="1"/>
</xs:restriction>
</xs:simpleType>
<xs:simpleType name="ReadingDirectionType">
<xs:annotation>
<xs:documentation>
Reading direction of some corresponding text.
</xs:documentation>
</xs:annotation>
<xs:restriction base="xs:string">
<xs:enumeration value="left-to-right"/>
<xs:enumeration value="right-to-left"/>
<xs:enumeration value="top-to-bottom"/>
<xs:enumeration value="bottom-to-top"/>
</xs:restriction>
</xs:simpleType>
<xs:simpleType name="NotEmptyType">
<xs:annotation>
<xs:documentation>
Simple type to restrict values to be non-empty strings.
</xs:documentation>
</xs:annotation>
<xs:restriction base="xs:token">
<xs:minLength value="1"/>
</xs:restriction>
</xs:simpleType>
</xs:schema>