Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CharsetAutoSelector (#222) #245

Closed

Conversation

DahyeonWoo
Copy link
Member

Feature

  • Input: desired String value
  • Output: the charset showing the shortest encoding result (Charset interface)
  • Add unit test
  • The document will be contributed separately.

@yeonns yeonns requested review from designe and yeonns September 12, 2022 03:52
Copy link
Member

@yeonns yeonns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DahyeonWoo It's great! Your CharsetAutoSelector returns the shortest encoded string.
However, it would be more helpful if it return Charset interface, not encoded string, because user want to know how this string is encoded and utilize charset. 😄

Copy link
Member

@designe designe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! @DahyeonWoo
Thanks to implement CharsetAutoSelector feature :)

I have some question & review here :)


namespace Euphony {

class CharsetAutoSelector : public Charset {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why you implemented Charset interface here?
I couldn't find implemented encode & decode functions.

Comment on lines +14 to +29
HexVector CharsetAutoSelector::select(std::string src) {

HexVector asciiResult = ASCIICharset().encode(src);
HexVector defaultResult = DefaultCharset().encode(src);
HexVector utf8Result = UTF8Charset().encode(src);
HexVector utf16Result = UTF16Charset().encode(src);
HexVector utf32Result = UTF32Charset().encode(src);

HexVector results[] = {asciiResult, defaultResult, utf8Result, utf16Result, utf32Result};

std::sort(results, results+5, [](HexVector& left, HexVector& right) {
return left.getSize() < right.getSize();
});

return results[0];
}
Copy link
Member

@designe designe Sep 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job to make select function
It'll make a shortest encoded data from sort.

Seeing this result, I thought it would be good to implement the Charset interface called AutoCharset.

class AutoCharset : public Charset {
  // encode & decode implemenation
}

However, what I expected is that select is gonna tell us type of charset.
this is utility for charset selection. so, function definition might be like below.
your proposal is also this one #222

Chraset* ChrasetAutoSelector::select(std::string src);

or

shared_ptr<Charset> CharsetAutoSelector::select(std::string src);

what do you think about it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. It has been revised and I will commit soon.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DahyeonWoo
Here is my thinking.
There are so many methods to implement CharsetAutoSelector :)

Firstly, analyze the charset's charecteristics.

  • ASCII is excluded because UTF-8 is included
  • UTF32 is unconditionally 4 bytes, except that it always returns the maximum length.
  • All that is left is to compare UTF-8 and UTF-16
  • If there are only Korean characters, UTF16 is short (2 bytes per Korean character)
  • If there is a lot of English and sometimes a little bit of Korean, UTF8 is short (because English is 1 byte and Korean is 3 bytes)

My pseudo code :)

std::shared_ptr<Charset> CharsetAutoSelector::select(std::string src) {
    std::shared_ptr<Charset> utf8Charset = make_shared<UTF8Charset>();
    std::shared_ptr<Charset> utf16Charset = make_shared<UTF16Charset>();

    return (utf8Charset->encode(src) > utf16Charset->encode(src) ? utc16Charset : utf8Charset;
}

Copy link
Member

@designe designe Sep 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And.. if return type is CharsetType,
we can use like below.

CharsetType CharsetAutoSelector::select(std::string src) {
    return (UTF8Charset().encode(src) > UTF16Charset().encode(src)) ? CharsetType::UTF16 : CharsetType::UTF8;
}

CharsetType is defined in Definition.h

enum class CharsetType : int32_t {

Which one do you like?
if you have some question, let me know :)

@designe designe added the feature Euphony's new feature label Sep 13, 2022
@yeonns
Copy link
Member

yeonns commented Sep 15, 2022

@DahyeonWoo Have any problem? We're waiting your updates 👀

@designe designe added the osca-22 Contribution From Open Source Academy 2022 label Sep 19, 2022
@DahyeonWoo DahyeonWoo closed this Jul 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Euphony's new feature osca-22 Contribution From Open Source Academy 2022
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Suggested: CharacterSet Auto Selector
3 participants