-
Notifications
You must be signed in to change notification settings - Fork 1
Documentation
All Field
types have a keyword argument serial
, which can be used to define the default name of this field on the stream. If this argument is not provided, the name of the Python attribute is used.
It is important to note that there is no way of storing pointers/references to Ann
instances other than using a Pointer
or Pointers
(or SelfPointer
/SelfPointers
) type. Storing them as the values in a Python dictionary for example, is not allowed.
If you wish to define a custom constructor for your custom annotation types, the constructor needs to accept a single argument, **kwargs
:
class Token(dr.Ann):
span = dr.Slice()
raw = dr.Field()
norm = dr.Field()
def __init__(self, **kwargs):
super(Token, self).__init__(**kwargs)
self.count = 0
A Field
represents any generic value type, excluding:
- A reference to another
Ann
instance (seePointer(X)
andSelfPointer
) - A list of
Ann
instances (seePointers(X)
andSelfPointers
) - A slice: a pair of integers or pointers (see
Slice
)
class Token(dr.Ann):
raw = dr.Field()
pos = dr.Field(serial='gold_pos')
A Pointer
field type is used to store a pointer to another (single) Ann
instance. When declaring a Pointer
instance, you are required to provide the Ann
subclass that this field points to. This argument can either be the Ann
subclass itself, or the name of the class as a string. The subclass itself is preferable.
class CCGSpan(dr.Ann):
cat = dr.Field()
l = dr.Pointer('CCGSpan', serial='left')
r = dr.Pointer('CCGSpan', serial='right')
When the Doc
object has more than one Store
instance for the same Ann
type, it is ambiguous as to which one of these two Store
instances a Pointer
is referring to. In this case, the keyword argument store
must be used to name the Store
instance on the document.
class Token(dr.Ann):
span = dr.Slice()
norm = dr.Field()
class Foo(dr.Ann):
tok = dr.Pointer(Token, store='ptb_tokens')
class Doc(dr.Doc):
ptb_tokens = dr.Store(Token)
bbn_tokens = dr.Store(Token)
In this example, if the store
keyword argument was omitted, it is ambiguous to know whether tok
is a pointer into the ptb_tokens
or the bbn_tokens
Store
instance.
Pointers
works in exactly the same way as Pointer
, except that it is used to store a list of pointers to Ann
type X
, instead of just a single pointer.
class PTBNode(dr.Ann):
children = dr.Pointers('PTBNode')
A SelfPointer
is a Pointer
which points back to the same type as the containing Ann
class. However, a store
is not specified for a SelfPointer
. The store
attribute is defined to be the same store that the current object is in. This is useful for node objects in a tree, where you might have more than one store of the type.
class Node(dr.Ann):
label = dr.Field()
parent = dr.Pointer('Node', store='???')
class Doc(dr.Ann):
gold_nodes = dr.Store(Node)
expr1_nodes = dr.Store(Node)
In this case, it is not possible to specify a sensible value for store
as you want the Pointer
to point into the same store as the current node. This is where SelfPointer
comes into play.
class Node(dr.Ann):
label = dr.Field()
parent = dr.SelfPointer()
class Doc(dr.Ann):
gold_nodes = dr.Store(Node)
expr1_nodes = dr.Store(Node)
The same as Pointers
except behaves like a SelfPointer
instead of like a Pointer
.
A Slice
is used to represent a pair of values: a begin and end point. These two values should either be two integers (e.g. byte offsets) or two pointers to instances of another Ann
type. In the pointer case, Slice
expects the same arguments upon construction as Pointer
. In the integer case, the type name argument can be omitted.
class Token(dr.Ann):
span = dr.Slice()
norm = dr.Field()
class Sentence(dr.Ann):
span = dr.Slice(Token)
All Ann
objects need to be stored on the Doc
. Collections of a particular Ann
type are defined using the Store
type.
class Doc(dr.Doc):
tokens = dr.Store(Token)
Like the Field
classes, Store
supports the serial
keyword argument, allowing you to define the default name of the the storage space on the stream.
class Doc(dr.Doc):
ptb_tokens = dr.Store(Token, serial='ptb')
bbn_tokens = dr.Store(Token, serial='bbn')
import argparse
from schwa import dr
class Token(dr.Ann):
span = dr.Slice()
raw = dr.Field(serial='roar')
norm = dr.Field()
class Sent(dr.Ann):
span = dr.Slice(Token)
class Paragraph(dr.Ann):
span = dr.Slice(Sent)
class Doc(dr.Doc):
filename = dr.Field(serial='fn')
tokens = dr.Store(Token)
sents = dr.Store(Sent)
pars = dr.Store(Paragraph)
# create the schema
schema = Doc.schema()
# populate ar argparse parser with the schema
parser = argparse.ArgumentParser()
parser.add_argument('--filename', required=True)
schema.add_to_argparse(parser)
args = parser.parse_args()
# read the docrep file with the runtime serial mappings
with open(args.filename, 'rb') as f:
reader = dr.Reader(f, schema)
for doc in reader:
print doc
from schwa import dr
class Token(dr.Ann):
span = dr.Slice()
norm = dr.Field()
class Sent(dr.Ann):
span = dr.Slice(Token)
class Doc(dr.Doc):
tokens = dr.Store(Token)
sents = dr.Store(Sent)
doc1 = Doc()
doc2 = Doc()
doc2.tokens.create(span=slice(0, 3), norm='The')
doc2.tokens.create(span=slice(4, 9), norm='quick')
doc2.tokens.create(span=slice(11, 16), norm='brown')
doc2.tokens.create(span=slice(17, 20), norm='fox')
doc2.tokens.create(span=slice(20, 21), norm='.')
doc2.sents.create(span=slice(0, 5))
with open(..., 'w') as fout:
writer = dr.Writer(fout, Doc)
writer.write(doc1)
writer.write(doc2)
Docrep primarily deals with getting the streaming format in and out of a memory model. To make in-memory documents more usable, they need to be decorated with derivative features. A decorator is a function that accepts a document and augments it, but by convention should not modify any fields that are read from or written to the stream.
dr.requires_decoration(decorator)
(or dr.method_requires_decoration(decorator)
) may wrap a function whose first (or second) argument is a document, to ensure that the decorator is executed first.
@dr.requires_decoration(decorator)
def process(doc):
do_stuff
equates to:
def process(doc):
decorator(doc)
do_stuff
Decorators can be defined with dr.Decorator
(dr.decorator
) which ensures that the same decorator (or one with the same key) is not executed on a document multiple times. This allows decoration to be performed applied when and where it is needed without worrying about unnecessary work.
For example:
class my_decorator(dr.Decorator):
def __init__(self, arg1, arg2):
# compile a key from class name and arguments to ensure single execution
super(my_decorator, self).__init__(self._build_key(arg1, arg2))
def decorate(self, doc):
...
@dr.requires_decoration(my_decorator('arg1-value', 'arg2-value'))
def process(doc):
...
A number of standard decorators have been implemented in dr.decorators
, which are more fully described by their docstrings and test-cases (especially test_decorators.ApplicationsTest
):
-
add_prev_next(store, prev_attr, next_attr, index_attr)
adds pointers and offsets (optionally) to a given store -
build_index(store, key_attr, index_attr, ...)
indexes the objects instore
over a given field (and can be used with an arbitrary index data structure) and stores it on the document -
build_multi_index(store, key_attr, index_attr, ...)
allows a many-to-many index -
materialise_slices(source_store, target_store, slice_attr, deref_attr)
stores the list of objects that a slice refers to -
reverse_slices(source_store, target_store, slice_attr, pointer_attr, offset_attr, ...)
augments an annotation A with its relationship to another annotation B whose slice covers A -
convert_slices(source_store, target_store, source_slice_attr, target_slice_attr, new_slice_attr)
dereferences nested slices, calculating e.g. a span over raw text given a span over tokens -
reverse_pointers(source_store, target_store, pointer_attr, rev_attr, ...)
augments an annotation A with a pointer to B where B points to A
When working with these, in general:
- a store argument may either be the store's attribute on the document, such as
'tokens'
, or a function which given the document returns the objects to process - an attribute to retrieve a field from an object may either be a string like
'span.start'
or a function likelambda token: token.span.start
- an attribute to set a field on an object may either be a string like
'raw'
orNone