Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Docusaurus to 3.4 and fix broken anchor links #348

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 23 additions & 15 deletions docs/manual/cellediting.md → docs/manual/cellediting.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,13 @@ id: cellediting
title: Cell editing
sidebar_label: Cell editing
---

{/* code comment:
we made a custom 3rd level blue colored heading to
more easily distinguish between upper 1st, 2nd headers from 4th, 5th level */}
import BlueH3 from '@site/src/components/BlueH3'


## Overview {#overview}

OpenRefine offers a number of features to edit and improve the contents of cells automatically and efficiently.
Expand Down Expand Up @@ -57,7 +64,7 @@ You can also convert cells into null values or empty strings. This can be useful

## Fill down and blank down {#fill-down-and-blank-down}

Fill down and blank down are two functions most frequently used when encountering data organized into [records](exploring#row-types-rows-vs-records) - that is, multiple rows associated with one specific entity.
Fill down and blank down are two functions most frequently used when encountering data organized into [records](exploring#rows-vs-records) - that is, multiple rows associated with one specific entity.

If you receive information in rows mode and want to convert it to records mode, the easiest way is to sort your first column by the value that you want to use as a unique records key, [make that sorting permanent](transforming#edit-rows), then blank down all the duplicates in that column. OpenRefine will retain the first unique value and erase the rest. Then you can switch from “Show as rows” to “Show as records” and OpenRefine will associate rows to each other based on the remaining values in the first column.

Expand All @@ -81,7 +88,7 @@ If you have data that should be split into multiple columns instead of multiple

## Join multi-valued cells {#join-multi-valued-cells}

Joining will reverse the “split multi-valued cells” operation, or join up information from multiple rows into one row. All the strings will be compressed into the topmost cell in the record, in the order they appear. A window will appear where you can set the separator; the default is a comma and a space (, ). This separator is optional. We suggest the separator | as a sufficiently rare character.
Joining will reverse the “split multi-valued cells” operation, or join up information from multiple rows into one row. All the strings will be compressed into the topmost cell in the record, in the order they appear. A window will appear where you can set the separator; the default is a comma and a space `, `. This separator is optional. We suggest the pipe separator `|` as a sufficiently rare character.

## Cluster and edit {#cluster-and-edit}

Expand Down Expand Up @@ -117,15 +124,16 @@ The clustering pop-up window offers you two categories of clustering methods: 6
* [Cologne Phonetic](#cologne-fingerprinting)
* [Daitch-Mokotoff](#daitch-mokotoff)
* [Beider-Morse](#baider-morse)

- [Nearest Neighbor](#nearest-neighbor)
* [Levenshtein](#levenshtein-distance)
* [PPM](#ppm)

#### Key Collision {#key-collision}
<BlueH3 id="key-collision">Key Collision</BlueH3>

**Key collisions** are very fast and can process millions of cells in seconds:

**<a name="fingerprinting">Fingerprinting</a>**
#### Fingerprinting {#fingerprinting}

Fingerprinting is the least likely to produce false positives, so it’s a good place to start. It does the same kind of data cleaning behind the scenes that you might think to do manually:

Expand All @@ -138,57 +146,57 @@ Fingerprinting is the least likely to produce false positives, so it’s a good

For an in-depth understanding of fingerprinting, check this [document](https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth#fingerprint)

**<a name="n-gram">N-gram Fingerprinting</a>**
#### N-gram Fingerprinting {#n-gram}

N-gram fingerprinting allows you to set the _n_ value to whatever number you’d like and will create n-grams of _n_ size (after doing some cleaning), alphabetize them, then join them back together into a fingerprint.

**For example**, a 1-gram fingerprint will simply organize all the letters in the cell into alphabetical order - by creating segments one character in length. A 2-gram fingerprint will find all the two-character segments, remove duplicates, alphabetize them, and join them back together (for example, “banana” generates “ba an na an na,” which becomes “anbana”).
For example, a 1-gram fingerprint will simply organize all the letters in the cell into alphabetical order - by creating segments one character in length. A 2-gram fingerprint will find all the two-character segments, remove duplicates, alphabetize them, and join them back together (for example, “banana” generates “ba an na an na,” which becomes “anbana”).

This can help match cells that have typos, or incorrect spaces (such as matching “lookout” and “look out,” which fingerprinting itself won’t identify because it separates words). The higher the _n_ value, the fewer clusters will be identified. With 1-grams, keep an eye out for mismatched values that are near-anagrams of each other (such as “Wellington” and “Elgin Town”).

For an in-depth understanding of N-gram fingerprinting, check this [document](https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth#n-gram-fingerprint)

**<a name="phonetic-clustering">Phonetic Clustering</a>**
#### Phonetic Clustering {#phonetic-clustering}

The next four methods are phonetic algorithms: they identify letters that sound the same when pronounced out loud, and assess text values based on that (such as knowing that a word with an “S” might be a mistype of a word with a “Z”). They are great for spotting mistakes made by not knowing the spelling of a word or name after hearing it spoken aloud.

**<a name="metaphone3-fingerprinting">Metaphone3 Fingerprinting</a>**
#### Metaphone3 Fingerprinting {#metaphone3-fingerprinting}

Metaphone3 fingerprinting is an English-language phonetic algorithm. For example, “Reuben Gevorkiantz” and “Ruben Gevorkyants” share the same phonetic fingerprint in English.

**<a name="cologne-fingerprinting">Cologne Fingerprinting</a>**
#### Cologne Fingerprinting {#cologne-fingerprinting}

Cologne fingerprinting is another phonetic algorithm, but for German pronunciation.

**<a name="daitch-mokotoff">Daitch-Mokotoff</a>**
#### Daitch-Mokotoff {#daitch-mokotoff}

Daitch-Mokotoff is a phonetic algorithm for Slavic and Yiddish words, especially names.

**<a name="baider-morse">Baider-Morse</a>**
#### Baider-Morse {#baider-morse}

Baider-Morse is a version of Daitch-Mokotoff that is slightly more strict.

Regardless of the language of your data, applying each of them might find different potential matches: for example, Metaphone clusters “Cornwall” and “Corn Hill” and “Green Hill,” while Cologne clusters “Greenvale” and “Granville” and “Cornwall” and “Green Wall.”

For an in-depth understanding of phonetics, check this [document](https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth#phonetic-fingerprint)

#### Nearest Neighbor {#nearest-neighbor}
<BlueH3 id="nearest-neighbor">Nearest Neighbor</BlueH3>

**Nearest Neighbor** clustering methods are slower than key collision methods.
Nearest Neighbor clustering methods are slower than key collision methods.

They allow the user to set a radius - a threshold for matching or not matching. OpenRefine uses a “blocking” method first, which sorts values based on whether they have a certain amount of similarity (the default is “6” for a six-character string of identical characters) and then runs the nearest-neighbor operations on those sorted groups.

We recommend setting the block number to at least 3, and then increasing it if you need to be more strict (for example, if every value with “river” is being matched, you should increase it to 6 or more).

**Note** that bigger block values will take much longer to process, while smaller blocks may miss matches. Increasing the radius will make the matches more lax, as bigger differences will be clustered.

**<a name="levenshtein-distance">Levenshtein Distance</a>**
#### Levenshtein Distance {#levenshtein-distance}

Levenshtein distance counts the number of edits required to make one value perfectly match another. As in the key collision methods above, it will do things like change uppercase to lowercase, fix whitespace, change special characters, etc. Each character that gets changed counts as 1 “distance.” “New York” and “newyork” have an edit distance value of 3 (“N” to “n”; “Y” to “y”; remove the space).

It can do relatively advanced edits, such as understanding the distance between “M. Makeba” and “Miriam Makeba” (5), but it may create false positives if these distances are greater than other, simpler transformations (such as the one-character distance to “B. Makeba,” another person entirely).

**<a name="ppm">PPM (Prediction by Partial Matching)</a>**
#### PPM (Prediction by Partial Matching) {#ppm}

PPM (Prediction by Partial Matching) uses compression to see whether two values are similar or different. In practice, this method is very lax even for small radius values and tends to generate many false positives, but because it operates at a sub-character level it is capable of finding substructures that are not easily identifiable by distances that work at the character level. So it should be used as a “last resort” clustering method. It is also more effective on longer strings than on shorter ones.

Expand Down
2 changes: 1 addition & 1 deletion docs/manual/columnediting.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,4 +121,4 @@ Note that for Mac users and for Windows users with the OpenRefine installation w
Every column's <span class="menuItems">Edit column</span> dropdown contains options to move it (to the beginning, end, left, or right), rename it, and delete it.
These operations can be undone, but a removed column cannot be restored later if you keep modifying your data. If you wish to temporarily hide a column, go to <span class="menuItems">[View](sortview#view)</span> → <span class="menuItems">Collapse this column</span> instead.

Be cautious about moving columns in [records mode](cellediting#rows-vs-records): if you change the first column in your dataset (the key column), your records may change in unintended ways.
Be cautious about moving columns in [records mode](exploring#rows-vs-records): if you change the first column in your dataset (the key column), your records may change in unintended ways.
4 changes: 2 additions & 2 deletions docs/manual/exploring.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ To transform data from one type to another, see [Transforming data](cellediting#

### Dates {#dates}

A “date” type is created when a column is [transformed into dates](transforming#to-date), when an expression is used to [convert cells to dates](grelfunctions#todateo-b-monthfirst-s-format1-s-format2-) or when individual cells are set to have the data type “date”.
A “date” type is created when a column is [transformed](transforming) into dates, when an expression is used to [convert cells to dates](grelfunctions#todateo-b-monthfirst-s-format1-s-format2-) or when individual cells are set to have the data type “date”.

Date-formatted data in OpenRefine relies on a number of conversion tools and standards. For something to be considered a date in OpenRefine, it will be converted into the ISO-8601-compliant extended format with time in UTC: YYYY-MM-DDTHH:MM:SSZ.

Expand Down Expand Up @@ -111,7 +111,7 @@ Once you are in records mode, you can still move some columns around, but if you

OpenRefine assigns a unique key behind the scenes, so your records don’t need a unique identifier in the key column. You can keep track of which rows are assigned to each record by the record number that appears under the <span class="menuItems">All</span> column.

To [split multi-valued cells](transforming#split-multi-valued-cells) and apply other operations that take advantage of records mode, see [Transforming data](transforming).
To [split multi-valued cells](cellediting#split-multi-valued-cells) and apply other operations that take advantage of records mode, see [Transforming data](transforming).

Be careful when in records mode that you do not accidentally delete rows based on being blank in one column where there is a value in another.

Expand Down
2 changes: 1 addition & 1 deletion docs/manual/exporting.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ To export data from a project, click the <span class="menuItems">Export</span> d
* Open Document Format (ODF) spreadsheet (ODS)
* Upload to Google Sheets (requires [Google account authorization](starting#google-sheet-from-drive))
* [Custom tabular exporter](#custom-tabular-exporter)
* [SQL statement exporter](#sql-statement-exporter)
* [SQL statement exporter](#sql-exporter)
* [Templating exporter](#templating-exporter), which generates JSON by default

You can also export reconciled data to Wikidata, or export your Wikidata schema for future use with other OpenRefine projects:
Expand Down
22 changes: 13 additions & 9 deletions docs/manual/expressions.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ When you select a function that accepts expressions, you will see a window overl

The expressions editor offers you a field for entering your formula and shows you a preview of its transformation on your first few rows of cells.

There is a dropdown menu from which you can choose an expression language. The default at first is GREL; if you begin working with another language, that selection will persist across OpenRefine. Jython and Clojure are also offered with the installation package, and you may be able to add more language support with third-party extensions and customizations.
There is a dropdown menu from which you can choose an expression language. The default at first is [GREL](grel); if you begin working with another language, that selection will persist across OpenRefine. [Jython](jythonclojure#jython) and [Clojure](jythonclojure#clojure) are also offered with the installation package, and you may be able to add more language support with third-party extensions and customizations.

There are also tabs for:
* <span class="tabLabels">History</span>, which shows you formulas you’ve recently used from across all your projects
Expand All @@ -86,9 +86,13 @@ To write a regular expression inside a GREL expression, wrap it between a pair o
value.replace(/\s+/, " ")
```

the regular expression is `\s+`, and the syntax used in the expression wraps it with forward slashes (`/\s+/`). Though the regular expression syntax in OpenRefine follows that of Java (normally in Java, you would write regex as a string and escape it like "\\s+"), a regular expression within a GREL expression is similar to Javascript.
the regular expression is `\s+`, and the syntax used in the expression wraps it with forward slashes (`/\s+/`).

:::info
The regular expression syntax in OpenRefine follows that of Java. Normally in Java, you would write regex as a string and escape it with two backslashes like `\\s+`. A regular expression within a GREL expression uses syntax similar to Javascript.

Do not use slashes to wrap regular expressions outside of a GREL expression.
:::

On the [GREL functions](grelfunctions) page, functions that support regex will indicate that with a “p” for “pattern.” The GREL functions that support regex are:
* [contains](grelfunctions#containss-sub-or-p)
Expand Down Expand Up @@ -123,11 +127,11 @@ Most OpenRefine variables have attributes: aspects of the variables that can be
|Variable |Meaning |
|-|-|
| `value` | The value of the cell in the current column of the current row (can be null) |
| `row` | The current row |
| `row.record` | One or more rows grouped together to form a record |
| `cells` | The cells of the current row, with fields that correspond to the column names (or row.cells) |
| `cell` | The cell in the current column of the current row, containing value and other attributes |
| `cell.recon` | The cell's reconciliation information returned from a reconciliation service or provider |
| [`row`](#row) | The current row |
| [`row.record`](#record) | One or more rows grouped together to form a record |
| [`cells`](#cells) | The cells of the current row, with fields that correspond to the column names (or row.cells) |
| [`cell`](#cell) | The cell in the current column of the current row, containing value and other attributes |
| [`cell.recon`](#reconciliation) | The cell's reconciliation information returned from a reconciliation service or provider |
| `rowIndex` | The index value of the current row (the first row is 0) |
| `columnName` | The name of the current cell's column, as a string |

Expand All @@ -142,7 +146,7 @@ The `row` variable itself is best used to access its member fields, which you ca
| `row.columnNames` | An array of the column names of the project. This will report all columns, even those with null cell values in that particular row. Call a column by number with `row.columnNames[3]` |
| `row.starred` | A boolean indicating if the row is starred |
| `row.flagged` | A boolean indicating if the row is flagged |
| `row.record` | The [record](#record) object containing the current row |
| [`row.record`](#record) | The [record](#record) object containing the current row |

For array objects such as `row.columnNames` you can preview the array using the expressions window, and output it as a string using `toString(row.columnNames)` or with something like:

Expand All @@ -164,7 +168,7 @@ You can use `cell` on its own in the expressions editor to copy all the contents
|-|-|-|
| `cell` | An object containing the entire contents of the cell | .value, .recon, .errorMessage |
| `cell.value` | The value in the cell, which can be a string, a number, a boolean, null, or an error | |
| `cell.recon` | An object encapsulating reconciliation results for that cell | See the [reconciliation](expressions#reconciliation) section |
| [`cell.recon`](#reconciliation) | An object encapsulating reconciliation results for that cell | See the [reconciliation](expressions#reconciliation) section |
| `cell.errorMessage` | Returns the message of an *EvalError* instead of the error object itself (use value to return the error object) | .value |

### Reconciliation {#reconciliation}
Expand Down
Loading