Skip to content

Commit

Permalink
group by tag in csv export
Browse files Browse the repository at this point in the history
  • Loading branch information
raphaellaude committed Nov 3, 2023
1 parent 0a56c93 commit 7ba3cfc
Show file tree
Hide file tree
Showing 3 changed files with 29 additions and 4 deletions.
10 changes: 8 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,14 @@ US Addrs is a rust crate for parsing unstructured United States address strings
It is a rust implementation of the awesome [usaddress](https://github.com/datamade/usaddress/tree/master) library.
Thank you to the folks at [datamade](https://datamade.us/) for releasing such a cool tool.

US Addrs is currently _79% (~5x) faster_ than usaddress, though additional optimizations should be possible. Accuracy stats TK.
US Addrs is currently _79% (~5x) faster_ than usaddress, though additional optimizations should be possible. [Accuracy](#accuracy) is close to usaddress but not quite matching yet.
The goal of this implementation is to faciliate use cases requiring better performance, such as geocoding large batches of addresses.

:warning: This crate is under **active development** and may not match usaddress's accuracy. US Addrs will be better tested / documented shortly.

## Examples

US Addrs can be run from the command line
US Addrs can be run from the command line to parse an address

```bash
cargo run -- parse --address '33 Nassau Avenue, Brooklyn, NY'
Expand All @@ -20,6 +20,12 @@ cargo run -- parse --address '33 Nassau Avenue, Brooklyn, NY'
[("33", "AddressNumber"), ("Nassau", "StreetName"), ("Avenue", "StreetNamePostType"), ("Brooklyn", "PlaceName"), ("NY", "StateName")]
```

or export a list of addresses to CSV

```bash
cargo run -- parse-file --file-path tests/test_data/test_addrs.txt test.csv
```

or by importing the crate and using the `parse` function

```rust
Expand Down
18 changes: 18 additions & 0 deletions src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -324,3 +324,21 @@ pub fn read_xml_tagged_addresses(file_path: &str) -> (Vec<String>, Vec<Vec<Strin

(addresses, tags)
}

pub fn group_by_tag(tokens: Vec<(String, String)>) -> Vec<(String, String)> {
let mut result = Vec::new();
let mut tokens = tokens.into_iter().peekable();

while let Some((mut token, tag)) = tokens.next() {
while tokens
.peek()
.map_or(false, |(_, ref next_tag)| &tag == next_tag)
{
let (next_token, _) = tokens.next().unwrap();
token = format!("{} {}", token, next_token);
}
result.push((token, tag));
}

result
}
5 changes: 3 additions & 2 deletions src/main.rs
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
use clap::Parser;
use us_addrs::train::train_model;
use us_addrs::{parse, parse_addresses_from_txt, TAGS};
use us_addrs::{group_by_tag, parse, parse_addresses_from_txt, TAGS};

// use std::path::PathBuf;

Expand Down Expand Up @@ -44,10 +44,11 @@ fn main() {
wtr.write_record(TAGS.iter()).unwrap();

for tagged_address in parsed_addresses {
let group_tagged_address = group_by_tag(tagged_address);
let mut record = Vec::new();

for tag in TAGS.iter() {
if let Some((token, _)) = tagged_address
if let Some((token, _)) = group_tagged_address
.iter()
.find(|&(_, token_tag)| *token_tag == *tag)
{
Expand Down

0 comments on commit 7ba3cfc

Please sign in to comment.