Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CharsetAutoSelector (#222) #245

Closed
Closed
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions euphony/src/main/cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ set(EUPHONY_SRC
core/charset/UTF16Charset.cpp
core/charset/UTF32Charset.cpp
core/charset/DefaultCharset.cpp
core/charset/CharsetAutoSelector.cpp
core/base/Base2.cpp
core/base/Base16.cpp
core/base/Base32.cpp
Expand Down
29 changes: 29 additions & 0 deletions euphony/src/main/cpp/core/charset/CharsetAutoSelector.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
#include "CharsetAutoSelector.h"
#include "ASCIICharset.h"
#include "DefaultCharset.h"
#include "UTF8Charset.h"
#include "UTF16Charset.h"
#include "UTF32Charset.h"
#include <sstream>
#include <iomanip>
#include <iostream>
#include <algorithm>

using namespace Euphony;

HexVector CharsetAutoSelector::select(std::string src) {

HexVector asciiResult = ASCIICharset().encode(src);
HexVector defaultResult = DefaultCharset().encode(src);
HexVector utf8Result = UTF8Charset().encode(src);
HexVector utf16Result = UTF16Charset().encode(src);
HexVector utf32Result = UTF32Charset().encode(src);

HexVector results[] = {asciiResult, defaultResult, utf8Result, utf16Result, utf32Result};

std::sort(results, results+5, [](HexVector& left, HexVector& right) {
return left.getSize() < right.getSize();
});

return results[0];
}
Comment on lines +14 to +29
Copy link
Member

@designe designe Sep 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job to make select function
It'll make a shortest encoded data from sort.

Seeing this result, I thought it would be good to implement the Charset interface called AutoCharset.

class AutoCharset : public Charset {
  // encode & decode implemenation
}

However, what I expected is that select is gonna tell us type of charset.
this is utility for charset selection. so, function definition might be like below.
your proposal is also this one #222

Chraset* ChrasetAutoSelector::select(std::string src);

or

shared_ptr<Charset> CharsetAutoSelector::select(std::string src);

what do you think about it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. It has been revised and I will commit soon.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DahyeonWoo
Here is my thinking.
There are so many methods to implement CharsetAutoSelector :)

Firstly, analyze the charset's charecteristics.

  • ASCII is excluded because UTF-8 is included
  • UTF32 is unconditionally 4 bytes, except that it always returns the maximum length.
  • All that is left is to compare UTF-8 and UTF-16
  • If there are only Korean characters, UTF16 is short (2 bytes per Korean character)
  • If there is a lot of English and sometimes a little bit of Korean, UTF8 is short (because English is 1 byte and Korean is 3 bytes)

My pseudo code :)

std::shared_ptr<Charset> CharsetAutoSelector::select(std::string src) {
    std::shared_ptr<Charset> utf8Charset = make_shared<UTF8Charset>();
    std::shared_ptr<Charset> utf16Charset = make_shared<UTF16Charset>();

    return (utf8Charset->encode(src) > utf16Charset->encode(src) ? utc16Charset : utf8Charset;
}

Copy link
Member

@designe designe Sep 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And.. if return type is CharsetType,
we can use like below.

CharsetType CharsetAutoSelector::select(std::string src) {
    return (UTF8Charset().encode(src) > UTF16Charset().encode(src)) ? CharsetType::UTF16 : CharsetType::UTF8;
}

CharsetType is defined in Definition.h

enum class CharsetType : int32_t {

Which one do you like?
if you have some question, let me know :)

16 changes: 16 additions & 0 deletions euphony/src/main/cpp/core/charset/CharsetAutoSelector.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#ifndef EUPHONY_CHARSETAUTOSELECTOR_H
#define EUPHONY_CHARSETAUTOSELECTOR_H

#include "Charset.h"

namespace Euphony {

class CharsetAutoSelector : public Charset {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why you implemented Charset interface here?
I couldn't find implemented encode & decode functions.

public:
CharsetAutoSelector() = default;
~CharsetAutoSelector() = default;
HexVector select(std::string src);
};
}

#endif //EUPHONY_CHARSETAUTOSELECTOR_H
1 change: 1 addition & 0 deletions euphony/src/main/cpp/tests/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ add_executable(
charset/utf8CharsetTest.cpp
charset/utf16CharsetTest.cpp
charset/utf32CharsetTest.cpp
charset/charsetAutoSelectorTest.cpp
base/base2Test.cpp
base/base16Test.cpp
base/base32Test.cpp
Expand Down
81 changes: 81 additions & 0 deletions euphony/src/main/cpp/tests/charset/charsetAutoSelectorTest.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
#include <gtest/gtest.h>
#include <Definitions.h>
#include <tuple>
#include <charset/ASCIICharset.h>
#include <charset/DefaultCharset.h>
#include <charset/UTF8Charset.h>
#include <charset/UTF16Charset.h>
#include <charset/UTF32Charset.h>
#include <iostream>
#include <algorithm>

using namespace Euphony;

typedef std::tuple<std::string, std::string> TestParamType;

class CharsetAutoSelectorTestFixture : public ::testing::TestWithParam<TestParamType> {

public:
void openCharset() {
EXPECT_EQ(asciiCharset, nullptr);
asciiCharset = new ASCIICharset();
ASSERT_NE(asciiCharset, nullptr);

EXPECT_EQ(defaultCharset, nullptr);
defaultCharset = new DefaultCharset();
ASSERT_NE(defaultCharset, nullptr);

EXPECT_EQ(utf8Charset, nullptr);
utf8Charset = new UTF8Charset();
ASSERT_NE(utf8Charset, nullptr);

EXPECT_EQ(utf16Charset, nullptr);
utf16Charset = new UTF16Charset();
ASSERT_NE(utf16Charset, nullptr);

EXPECT_EQ(utf32Charset, nullptr);
utf32Charset = new UTF16Charset();
ASSERT_NE(utf32Charset, nullptr);
}

Charset *asciiCharset = nullptr;
Charset *defaultCharset = nullptr;
Charset *utf8Charset = nullptr;
Charset *utf16Charset = nullptr;
Charset *utf32Charset = nullptr;
};

TEST_P(CharsetAutoSelectorTestFixture, SelectTest) {
openCharset();

std::string source;
std::string expectedResult;

std::tie(source, expectedResult) = GetParam();

HexVector asciiResult = asciiCharset->encode(source);
HexVector defaultResult = defaultCharset->encode(source);
HexVector utf8Result = utf8Charset->encode(source);
HexVector utf16Result = utf16Charset->encode(source);
HexVector utf32Result = utf32Charset->encode(source);

HexVector results[] = {asciiResult, defaultResult, utf8Result, utf16Result, utf32Result};

std::sort(results, results+5, [](HexVector& left, HexVector& right) {
return left.getSize() < right.getSize();
});

HexVector actualResult = results[0];

EXPECT_EQ(actualResult.toString(), expectedResult);
}

INSTANTIATE_TEST_CASE_P(
CharsetAutoSelectorTestSuite,
CharsetAutoSelectorTestFixture,
::testing::Values(
TestParamType("a", "61"),
TestParamType("b", "62"),
TestParamType("가", "eab080"),
TestParamType("나", "eb8298")
));