-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CharsetAutoSelector (#222) #245
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DahyeonWoo It's great! Your CharsetAutoSelector
returns the shortest encoded string.
However, it would be more helpful if it return Charset
interface, not encoded string, because user want to know how this string is encoded and utilize charset. 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi! @DahyeonWoo
Thanks to implement CharsetAutoSelector
feature :)
I have some question & review here :)
|
||
namespace Euphony { | ||
|
||
class CharsetAutoSelector : public Charset { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why you implemented Charset
interface here?
I couldn't find implemented encode
& decode
functions.
HexVector CharsetAutoSelector::select(std::string src) { | ||
|
||
HexVector asciiResult = ASCIICharset().encode(src); | ||
HexVector defaultResult = DefaultCharset().encode(src); | ||
HexVector utf8Result = UTF8Charset().encode(src); | ||
HexVector utf16Result = UTF16Charset().encode(src); | ||
HexVector utf32Result = UTF32Charset().encode(src); | ||
|
||
HexVector results[] = {asciiResult, defaultResult, utf8Result, utf16Result, utf32Result}; | ||
|
||
std::sort(results, results+5, [](HexVector& left, HexVector& right) { | ||
return left.getSize() < right.getSize(); | ||
}); | ||
|
||
return results[0]; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job to make select
function
It'll make a shortest encoded data from sort
.
Seeing this result, I thought it would be good to implement the Charset interface called AutoCharset.
class AutoCharset : public Charset {
// encode & decode implemenation
}
However, what I expected is that select
is gonna tell us type of charset.
this is utility for charset selection. so, function definition might be like below.
your proposal is also this one #222
Chraset* ChrasetAutoSelector::select(std::string src);
or
shared_ptr<Charset> CharsetAutoSelector::select(std::string src);
what do you think about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. It has been revised and I will commit soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DahyeonWoo
Here is my thinking.
There are so many methods to implement CharsetAutoSelector
:)
Firstly, analyze the charset's charecteristics.
- ASCII is excluded because UTF-8 is included
- UTF32 is unconditionally 4 bytes, except that it always returns the maximum length.
- All that is left is to compare UTF-8 and UTF-16
- If there are only Korean characters, UTF16 is short (2 bytes per Korean character)
- If there is a lot of English and sometimes a little bit of Korean, UTF8 is short (because English is 1 byte and Korean is 3 bytes)
My pseudo code :)
std::shared_ptr<Charset> CharsetAutoSelector::select(std::string src) {
std::shared_ptr<Charset> utf8Charset = make_shared<UTF8Charset>();
std::shared_ptr<Charset> utf16Charset = make_shared<UTF16Charset>();
return (utf8Charset->encode(src) > utf16Charset->encode(src) ? utc16Charset : utf8Charset;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And.. if return type is CharsetType
,
we can use like below.
CharsetType CharsetAutoSelector::select(std::string src) {
return (UTF8Charset().encode(src) > UTF16Charset().encode(src)) ? CharsetType::UTF16 : CharsetType::UTF8;
}
CharsetType
is defined in Definition.h
enum class CharsetType : int32_t { |
Which one do you like?
if you have some question, let me know :)
@DahyeonWoo Have any problem? We're waiting your updates 👀 |
Feature