Skip to content

Releases: supabase-community/copycat

v0.18.1

02 Nov 14:02
Compare
Choose a tag to compare

⚠ BREAKING CHANGES

Performance improvements

In order to always return the same output for the same input, copycat works by hashing the given input value to a number, and then using the resulting number to compute the output value. Under the hood, we use SipHash to do this (more context here).

Sometimes though, we need a sequence of numbers to compute an output, rather than a single number. A good example of this is copycat.scramble(), where we need a sequence of numbers, one for each character in the input. In these cases, we were using SipHash to compute the initial hash, and then for each number in the sequence, use the faster fnv1a to re-hash the current hash value to obtain the next number in the sequence in a deterministic way.

However, the tool we're looking for in this case is a seeded deterministic pseudo-random number generator (PRNG). In turns out there are much faster algorithms for accomplishing this than using fnv1a, which is really more a hash function than a PRNG. After experimenting a bit, we settled on using splitmix.

What does this mean for me?

This sequencing functionality described above is a core part of copycat. Any method relying on this sequencing functionality (transitively via other copycat methods or directly) would be impacted. In the case of large outputs, you might see computation time performance improvements (benchmarks for copycat.scramble() below).

However, it also means that for all the affected methods, the same input value would now map to an entirely different output value (since we changed the underlying function we were using for computing a sequence of numbers for a given hash value). That said, the same input value will still map to that same output value, it is just that the output value is different to what it was in previous releases. In other words, copycat is still deterministic, it is just that this release came with a change to the mapping from inputs to outputs.

Benchmark results for copycat.scramble()

{
  "name": "100 words",
  "results": [
    {
      "ms": 0.8084074373484236,
      "name": "copycat.scramble(): before (0.17.3)",
      "ops": 1237,
      "percentSlower": 94.96
    },
    {
      "ms": 0.04073485681697829,
      "name": "copycat.scramble(): after (reimplementation)",
      "ops": 24549,
      "percentSlower": 0
    }
  ]
}
{
  "name": "100000 words",
  "results": [
    {
      "ms": 1000,
      "name": "copycat.scramble(): before (0.17.3)",
      "ops": 1,
      "percentSlower": 93.33
    },
    {
      "ms": 66.66666666666667,
      "name": "copycat.scramble(): after (reimplementation)",
      "ops": 15,
      "percentSlower": 0
    }
  ]
}

More context here: #45

Lorem ipsum

copycat.word() and other word methods (@copycat.words(), @copycat.paragraph() and @copycat.sentence()) changed from generating text like this:

Kai ni viramira memayo kayu. Hahyceavi nameta mohy shichiacea menivayu shi mika yokinmu, nahyraki hyka chi niceavi ta. Ta hamevakin yuno hyakova nivami yohycea ko, yoha shiyu miha hy kiko kinyoshi ka ninoshi. Notakimu yo yukake kakekaihy vaceaso vakiso nomu rae, yukin chiraekimo ceavino yo muyo. Hyva memayo shikemavi ka kakesokin mamuhamo kinmukame mora, ranino masochiyo kinoa kesoni mamo. Va nohakin komiva shimo hykikayo makinra yorae, sovami kai raenira raeyo sonavi mo mora chirae

To text like this:

Et modo lucilias legatomnem et. Quis ratio iudicur ut defuitur quod interessar endis, doloria romandum athenisse explicem quia. Expeten quam hoc ex amus ant sive, providintem ad claudicur torquato nes nihil nec ut. Audiri dicerea summum arisset ne exceperem tam si, amartifex doloris nam quae ipsum. Et causa iudicitat extremum endam tota tum antippus, de vidi videbo rerum ut. Affere ab mundi nimium summa partemerror causae, his am semperfruique in sapiens gloriatur et dicenim

Context

The idea behind the way we used to generate words, was to use Japanese-like syllables for simplicity, since they compose easily (any syllable can follow any other).

It would be more ideal though to use something more standard and expected, such as "lorem ipsum" placeholder text. It is also arguably a better representation of fake text that is meant to be read (e.g. on a web page) - since we generate words with latin characters. Latin words are naturally more representative of where these characters would be seen. In contrast, the previous fake text is not - it is closer to resembling a romanized form of text from a language not written with latin characters, and so it is probably less representative of what would appear as text meant to be read (e.g. on a web page).

More context here: #46

v0.1.0

09 May 07:41
Compare
Choose a tag to compare

This is the first release of Snaplet's Copycat! It includes the most important transformations for personally identifiable information in databases.