Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unilex, Corpus Crawler, Romanische Korpora und Rechtschreibkorrektur #7

Open
loleg opened this issue Nov 24, 2018 · 0 comments
Open
Labels
data source A (potential) source of (open) data

Comments

@loleg
Copy link
Member

loleg commented Nov 24, 2018

Thanks to @brawer for suggesting the following tools and scripts for the hackathon.

Unilex — https://github.com/unicode-org/unilex — Worthäufigkeiten (pro Milliarde Tokens) für ca. 1000 Sprachen, inkl. Rumantsch Grischun, Puter, Surmiran, Sursilvan, Sutsilvan, und Vallader sowie Aargauer-, Bern- und Senslerdeutsch. — Lizenz: Unicode Data License (dieselbe wie alle anderen Daten von Unicode.org).

Corpus Crawler — https://github.com/googlei18n/corpuscrawler — Python-Skript zum Herunterladen von Sprachkorpora als UTF8-kodierten Plaintext, unterstützt ca. 1000 Sprachen; Hauptquelle für die Worthäufigkeiten Unilex — Lizenz: Apache-2.0

Romanische Korpora — https://github.com/ProSvizraRumantscha/corpora — Articles from the Romansh-language newspaper “La Quotidiana” between 1997 and 2008 in various dialects. — Lizenz: CC0-1.0 (Creative Commons Zero, Public Domain)

Rechtschreibkorrektur für Rumantsch Sursilvan — https://github.com/korero/korero-spell/tree/master/rules — Angefangenes Projekt für Rechtschreibkorrektur für Rumantsch Survilvan, mit Hunspell-Regeln — Demo: https://korero.org/input — Lizenz: GPL-3.0

@loleg loleg added the data source A (potential) source of (open) data label Nov 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data source A (potential) source of (open) data
Projects
None yet
Development

No branches or pull requests

1 participant