Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random stuff on encoding #1

Open
ChrisMuir opened this issue Jul 25, 2018 · 2 comments
Open

Random stuff on encoding #1

ChrisMuir opened this issue Jul 25, 2018 · 2 comments

Comments

@ChrisMuir
Copy link

Just came across this repo, it's 👍

I work with Chinese data in R and Python, and have struggled often with encoding issues in R, so figured I'd share some of the things I've come across and learned in the process. Most of the stuff below is related to file system functions and identifying files. Feel free to add any of this to the repo if you'd like (or not, it's all good).

list.files() versus Sys.glob()

list.files() fails at preserving Chinese chars, use Sys.glob() instead. Here's an example:

# Create a new file with Chinese chars in the file name.
temp_dir <- tempdir()
file_conn <- file(paste0(temp_dir, "/假文件名.txt"))
writeLines("cats", file_conn)
close(file_conn)

# :-(
files <- list.files(temp_dir, pattern = "*.txt")
files
#> [1] "????.txt"

# :-)
files <- Sys.glob(file.path(temp_dir, "*.txt"))
unlist(strsplit(files, "/"))[2]
#> [1] "假文件名.txt"

# Clean-up
unlink(temp_dir)

Issue with Chinese Parenthesis Chars in File Names

library(fs)

# This is a file that exists in my current working directory, it contains 
# Chinese parenthesis.
file_name <- "假文件名(12家).txt"

# Checking if it exists fails.
file.exists(file_name)
#> [1] FALSE

fs::is_file(file_name)
#> 假文件名(12家).txt 
#>                FALSE

## Issue is that the functions above are treating the Chinese paren chars 
## in object "file_name" as English paren chars. 
## Use conversion between utf8 and int to facilitate ID'ing the file.

# Function to take in a string that contains parenthesis chars and 
# replaces them with Chinese parenthesis chars (as ints).
cn_paren <- function(x) {
  x_int <- utf8ToInt(x)
  x_int[x_int == 40] <- 65288
  x_int[x_int == 41] <- 65289
  intToUtf8(x_int)
}

file_name_cn <- cn_paren(file_name)


# Test for the existence, now works.
file.exists(file_name_cn)
#> [1] TRUE

fs::is_file(file_name_cn)
#> 假文件名(12家).txt 
#>                     TRUE


# Can see a slight difference in the parenthesis chars in the two strings 
# when printed.
file_name
#> [1] "假文件名(12家).txt"

file_name_cn
#> [1] "假文件名(12家).txt"


# We also see the difference when using ust8ToInt()
utf8ToInt(file_name)
#> [1] 20551 25991 20214 21517    40    49    50 23478    41    46   116   120   116

utf8ToInt(file_name_cn)
#> [1] 20551 25991 20214 21517 65288    49    50 23478 65289    46   116   120   116

Package fs

The fs package is great, there's been a few times where it's been able to ID a file on my PC for which base functions have failed. I often use fs::is_file() in place of base::file.exists(), and fs::file_copy() in place of base::file.copy().

This Kevin Ushey Blog Post

This blog post by Kevin Ushey on string encoding in R is fantastic (and the comments are full of info as well).

System Info

And here's my system/local info

Sys.getlocale()
#> [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

getOption("encoding")
#> [1] "native.enc"

sessionInfo()
#> R version 3.5.0 (2018-04-23)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows >= 8 x64 (build 9200)

#> Matrix products: default

#> locale:
#> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.1252

#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     

#> loaded via a namespace (and not attached):
#> [1] compiler_3.5.0 tools_3.5.0    fs_1.2.3       yaml_2.1.19    Rcpp_0.12.17
@BruceZhaoR
Copy link
Owner

@ChrisMuir many thanks!

I've already known the fs package, which brought form the libuv library. Maybe you already known the utf8 package

While R encoding is a complex problem. Your code list.files() works fine on my computer, because my native.enc is Chinese Encoding. I advice your to read the params to findout if there is a encoding param, and specify it to a certain encoding, like UTF-8 , GB18030/GBK, etc. Maybe it could avoid encoding problems.

I will add more examples to solve the Chinese Encoding problems when I am not busy.

BTW, I find your geolocChina repo, I may use map api to do that. For more info you can visit:
https://lbs.amap.com/api/webservice/guide/api/georegeo/#geo. But it's not as convenient as yours.

> Sys.getlocale()
[1] "LC_COLLATE=Chinese (Simplified)_China.936;LC_CTYPE=Chinese (Simplified)_China.936;LC_MONETARY=Chinese (Simplified)_China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_China.936"
> l10n_info()
$`MBCS`
[1] TRUE

$`UTF-8`
[1] FALSE

$`Latin-1`
[1] FALSE

$codepage
[1] 936

> Sys.localeconv()
    decimal_point     thousands_sep          grouping   int_curr_symbol   currency_symbol 
              "."                ""                ""             "CNY"              "" 
mon_decimal_point mon_thousands_sep      mon_grouping     positive_sign     negative_sign 
              "."               ","            "\003"                ""               "-" 
  int_frac_digits       frac_digits     p_cs_precedes    p_sep_by_space     n_cs_precedes 
              "2"               "2"               "1"               "0"               "1" 
   n_sep_by_space       p_sign_posn       n_sign_posn 
              "0"               "4"               "4" 

@ChrisMuir
Copy link
Author

Hi Bruce, thanks for the reply!

Yeah I should have mentioned that both of my code examples are specific to an English local.

Ah, I've never used lbs amap, that's cool. I use the Baidu Maps API (via baidumap R package), for situations in which the geolocChina functions fail. It's not ideal though, since there's a daily limit, and getting county/city/province info for a string requires two API calls (the first to get lat/lon, the second takes in the lat/lon and returns location info).

Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants