xtsv-word

This module provides a special dataclass-like structure to handle tokens in xtsv format. It is meant to be used within xtsv modules to make the processing of token attributes (xtsv fields) more comfortable and transparent.

It allows for each token to be represented by a Word object which is initialised simply by passing the token in its xtsv representation (i.e. a list of strings, one for each field in the input stream) to a WordFactory object. This factory object keeps track of the input and target fields of the xtsv module, and assigns the items of the list representing the token in xtsv to the respective Word object attributes which are identified by the name of the corresponding field (i.e. the xtsv column header).

Both the input and the target fields can be accessed as attributes of a Word object, i.e. they can be both retrieved and modified. (The usual use case is to only read input field attributes and to specify target field attributes. However, the Word object does not prevent a user from modifying input field values. This is discouraged but not ruled out by xtsv.) When the xtsv module is done processing a token, the Word object is simply converted into a list which contains the original input fields followed by the target fields, as expected by xtsv.

Disclaimer: This is not an official extension of the xtsv module.

Suggested usage

Install xtsv-word from pip:

python3 -m pip install xtsv-word

or build locally:

make

For example, assuming that the internal app object defined in the xtsv module myXtsvModule is called InternalApp, the input stream contains the fields ['form', 'wsafter'] and myXtsvModule has a single target field: ['syllables']:

Create WordFactory object:

# myXtsvModule.py

from xtsv_word import WordFactory

class InternalApp:
	...
	def prepare_fields(self, field_names):
		self.wf = WordFactory(field_names, self.target_fields)
		# self.target_fields is normally set in InternalApp.__init__()

Use Word object:

class InternalApp:
	...
	def process_sentence(self, sen, field_values):
		return_sen = []
		for tok in sen:
			# Get Word object from factory
			word = self.wf.get_word(tok)

			# process token by getting and setting its attributes, e.g.:
			word.syllables = split_syllables(word.form)
			...
			# alternatively access attributes as dict keys:
			word['syllables'] = '-'.join(word['syllables'])

			# convert Word object to list of fields for xtsv output stream
			return_sen.append(list(word))

		...
		return return_sen

See docstrings for further details.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
tests		tests
xtsv_word		xtsv_word
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

xtsv-word

Suggested usage

About

Releases

Packages

Languages

ril-lexknowrep/xtsv-word

Folders and files

Latest commit

History

Repository files navigation

xtsv-word

Suggested usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages