Skip to content
Tim Dawborn edited this page Mar 30, 2014 · 2 revisions

Fields

All Field types have a keyword argument serial, which can be used to define the default name of this field on the stream. If this argument is not provided, the name of the Python attribute is used.

It is important to note that there is no way of storing pointers/references to Ann instances other than using a Pointer or Pointers (or SelfPointer/SelfPointers) type. Storing them as the values in a Python dictionary for example, is not allowed.

Field types for Ann subclasses

If you wish to define a custom constructor for your custom annotation types, the constructor needs to accept a single argument, **kwargs:

class Token(dr.Ann):
  span = dr.Slice()
  raw = dr.Field()
  norm = dr.Field()

  def __init__(self, **kwargs):
    super(Token, self).__init__(**kwargs)
    self.count = 0

Field()

A Field represents any generic value type, excluding:

  • A reference to another Ann instance (see Pointer(X) and SelfPointer)
  • A list of Ann instances (see Pointers(X) and SelfPointers)
  • A slice: a pair of integers or pointers (see Slice)
class Token(dr.Ann):
  raw = dr.Field()
  pos = dr.Field(serial='gold_pos')

Pointer(X)

A Pointer field type is used to store a pointer to another (single) Ann instance. When declaring a Pointer instance, you are required to provide the Ann subclass that this field points to. This argument can either be the Ann subclass itself, or the name of the class as a string. The subclass itself is preferable.

class CCGSpan(dr.Ann):
  cat = dr.Field()
  l = dr.Pointer('CCGSpan', serial='left')
  r = dr.Pointer('CCGSpan', serial='right')

When the Doc object has more than one Store instance for the same Ann type, it is ambiguous as to which one of these two Store instances a Pointer is referring to. In this case, the keyword argument store must be used to name the Store instance on the document.

class Token(dr.Ann):
  span = dr.Slice()
  norm = dr.Field()

class Foo(dr.Ann):
  tok = dr.Pointer(Token, store='ptb_tokens')

class Doc(dr.Doc):
  ptb_tokens = dr.Store(Token)
  bbn_tokens = dr.Store(Token)

In this example, if the store keyword argument was omitted, it is ambiguous to know whether tok is a pointer into the ptb_tokens or the bbn_tokens Store instance.

Pointers(X)

Pointers works in exactly the same way as Pointer, except that it is used to store a list of pointers to Ann type X, instead of just a single pointer.

class PTBNode(dr.Ann):
  children = dr.Pointers('PTBNode')

SelfPointer()

A SelfPointer is a Pointer which points back to the same type as the containing Ann class. However, a store is not specified for a SelfPointer. The store attribute is defined to be the same store that the current object is in. This is useful for node objects in a tree, where you might have more than one store of the type.

class Node(dr.Ann):
  label = dr.Field()
  parent = dr.Pointer('Node', store='???')

class Doc(dr.Ann):
  gold_nodes = dr.Store(Node)
  expr1_nodes = dr.Store(Node)

In this case, it is not possible to specify a sensible value for store as you want the Pointer to point into the same store as the current node. This is where SelfPointer comes into play.

class Node(dr.Ann):
  label = dr.Field()
  parent = dr.SelfPointer()

class Doc(dr.Ann):
  gold_nodes = dr.Store(Node)
  expr1_nodes = dr.Store(Node)

SelfPointers()

The same as Pointers except behaves like a SelfPointer instead of like a Pointer.

Slice()

A Slice is used to represent a pair of values: a begin and end point. These two values should either be two integers (e.g. byte offsets) or two pointers to instances of another Ann type. In the pointer case, Slice expects the same arguments upon construction as Pointer. In the integer case, the type name argument can be omitted.

class Token(dr.Ann):
  span = dr.Slice()
  norm = dr.Field()

class Sentence(dr.Ann):
  span = dr.Slice(Token)

Field types for Doc subclasses

Store(X)

All Ann objects need to be stored on the Doc. Collections of a particular Ann type are defined using the Store type.

class Doc(dr.Doc):
  tokens = dr.Store(Token)

Like the Field classes, Store supports the serial keyword argument, allowing you to define the default name of the the storage space on the stream.

class Doc(dr.Doc):
  ptb_tokens = dr.Store(Token, serial='ptb')
  bbn_tokens = dr.Store(Token, serial='bbn')

Reading

import argparse

from schwa import dr


class Token(dr.Ann):
  span = dr.Slice()
  raw = dr.Field(serial='roar')
  norm = dr.Field()


class Sent(dr.Ann):
  span = dr.Slice(Token)


class Paragraph(dr.Ann):
  span = dr.Slice(Sent)


class Doc(dr.Doc):
  filename = dr.Field(serial='fn')
  tokens = dr.Store(Token)
  sents = dr.Store(Sent)
  pars = dr.Store(Paragraph)


# create the schema
schema = Doc.schema()

# populate ar argparse parser with the schema
parser = argparse.ArgumentParser()
parser.add_argument('--filename', required=True)
schema.add_to_argparse(parser)
args = parser.parse_args()

# read the docrep file with the runtime serial mappings
with open(args.filename, 'rb') as f:
  reader = dr.Reader(f, schema)
  for doc in reader:
    print doc

Writing

from schwa import dr

class Token(dr.Ann):
  span = dr.Slice()
  norm = dr.Field()


class Sent(dr.Ann):
  span = dr.Slice(Token)


class Doc(dr.Doc):
  tokens = dr.Store(Token)
  sents = dr.Store(Sent)


doc1 = Doc()

doc2 = Doc()
doc2.tokens.create(span=slice(0, 3), norm='The')
doc2.tokens.create(span=slice(4, 9), norm='quick')
doc2.tokens.create(span=slice(11, 16), norm='brown')
doc2.tokens.create(span=slice(17, 20), norm='fox')
doc2.tokens.create(span=slice(20, 21), norm='.')
doc2.sents.create(span=slice(0, 5))

with open(..., 'w') as fout:
  writer = dr.Writer(fout, Doc)
  writer.write(doc1)
  writer.write(doc2)

Decoration

Docrep primarily deals with getting the streaming format in and out of a memory model. To make in-memory documents more usable, they need to be decorated with derivative features. A decorator is a function that accepts a document and augments it, but by convention should not modify any fields that are read from or written to the stream.

Applying decorators

dr.requires_decoration(decorator) (or dr.method_requires_decoration(decorator)) may wrap a function whose first (or second) argument is a document, to ensure that the decorator is executed first.

@dr.requires_decoration(decorator)
def process(doc):
  do_stuff

equates to:

def process(doc):
  decorator(doc)
  do_stuff

Defining decorators

Decorators can be defined with dr.Decorator (dr.decorator) which ensures that the same decorator (or one with the same key) is not executed on a document multiple times. This allows decoration to be performed applied when and where it is needed without worrying about unnecessary work.

For example:

class my_decorator(dr.Decorator):
  def __init__(self, arg1, arg2):
    # compile a key from class name and arguments to ensure single execution
    super(my_decorator, self).__init__(self._build_key(arg1, arg2))

  def decorate(self, doc):
    ...

@dr.requires_decoration(my_decorator('arg1-value', 'arg2-value'))
def process(doc):
  ...

Standard decorators

A number of standard decorators have been implemented in dr.decorators, which are more fully described by their docstrings and test-cases (especially test_decorators.ApplicationsTest):

  • add_prev_next(store, prev_attr, next_attr, index_attr) adds pointers and offsets (optionally) to a given store
  • build_index(store, key_attr, index_attr, ...) indexes the objects in store over a given field (and can be used with an arbitrary index data structure) and stores it on the document
  • build_multi_index(store, key_attr, index_attr, ...) allows a many-to-many index
  • materialise_slices(source_store, target_store, slice_attr, deref_attr) stores the list of objects that a slice refers to
  • reverse_slices(source_store, target_store, slice_attr, pointer_attr, offset_attr, ...) augments an annotation A with its relationship to another annotation B whose slice covers A
  • convert_slices(source_store, target_store, source_slice_attr, target_slice_attr, new_slice_attr) dereferences nested slices, calculating e.g. a span over raw text given a span over tokens
  • reverse_pointers(source_store, target_store, pointer_attr, rev_attr, ...) augments an annotation A with a pointer to B where B points to A

When working with these, in general:

  • a store argument may either be the store's attribute on the document, such as 'tokens', or a function which given the document returns the objects to process
  • an attribute to retrieve a field from an object may either be a string like 'span.start' or a function like lambda token: token.span.start
  • an attribute to set a field on an object may either be a string like 'raw' or None