-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmarking the design #126
Comments
First I think you should focus on a more simple "schema" for your keys. There is no point in trying to optimize a "bad" schema (except to realize that it is a bad schema!), and my experience is to always start to create a simple version by focusing on the simplest way to encode keys that will give me the ability to perform range reads as much as possible, and then be able to parallelize reads by batch as much as possible. This usually gets you 80% of the way. Then if you have either a lot of data, or a lot of read/writes per second then you can "break" this nice and simple schema into something more involved, to achieve your target performance. Talking about target performance, I don't think that your requirement is to be able to insert a million employees per second, or to guarantee that fetching the list of employees of a single department takes less than 5 milliseconds? If your requirements are more "relaxed" like for a typical internal web portal, where you would be happy with response time in < 200ms, usually you don't need to do more work! -- I think that you should not use sub-directories for each employee, unless if you need to be able to store a lot of data for each employee (like a mini database with multiple "tables", specific to each person), or need very strict "tenant" isolation. Also, it depends on the number of employees, and keys per employee, combined with the expected number of requests per second. Since there is a finite number of humans on the planet, and probably even less working for the same company, I don't think that this dataset will have a high traffic. Probably mostly reading the employee details, and maybe adding/updating a handful of entry per day? If this is the case, then I think you can achieve all of your requirements using a "main" table to store employee details, using a single directory, by using the employee unique ID (depends on your model, maybe an integer? a string? a guid?):
Here the '...' in the tuple refers to the prefix of the main directory that you open, so if you have opened it as This gives your first requirement of being able to get an employee knowing it's ID (just read back the key). Also, the '0' as the first part is here to prepare the way for adding indexes. I usually use 0 for the 'main' content (here the details for each person), and then use 1, 2, etc... for the rest. Note that integer 0 is encoded as a single byte (0x14) with the tuple encoding, while 1, 2, 3 will use two bytes (0x15 ##) so if you know that one "subspace" will have a lot more data than the other, then you could use 0 for this one, and >= 1 for the other, but other than that it's only a convention that you use). In my code I usually use constants to encode these prefixes, in order to give them a nice name. private const int EMPLOYEES_BY_ID = 0; // main table with employee details
private const int EMPLOYEES_BY_DEPARTMENT = 1; // index by department id (non-unique)
private const int EMPLOYEES_BY_EMAIL = 2; // index by email (unique) That way the code to generate keys looks like So now with this, we have all the employee details that are store as in a single range that is contiguous, and has the benefit of being already sorted by employee ID. So if you need to scan through all employees, you need a single range read under -- To handle departments, you need an additional index, so that's where we use the integer prefix constants introduced above. Here we have a non-unique index, meaning multiple employees belong to the same department. The typical pattern for this, is to encode both the indexed value and the document id in the key, and leave the value empty.
In this case, the indexed value is the unique id of the department (could be anything: an integer, string, Guid, but I would try not to use the actual name of the entity: it is prone to typos, casing issues, and also a department could be renamed!). The document id here will be the unique id of the employee, the same that is used in the "main" table (prefix 0). When you range read under To "add" an employee to a department, you simply create the key via To "remove" the employee from a department, you DO NOT forget to properly update the indexes whenever the user is updated! Especially if the user changes department, you have to clear the old key (with previous dept id) and create a new one (with new dept id) otherwise the index will be corrupted! Ideally, you should always update the employee details and the indexes in the same transaction, never split across multiple transactions! Once you have indexed all the employees, you will have a first range under (..., 0) with all employee details, and then next to it a second range of (..., 1) with the index. This index is itself sub-divided into small ranges of all the employees for each department. So if you want to know all the employees of a single department, you just need to If you only need the list of IDs you are done, but if you want the details, you need to do a second set of read, to read all the keys under (...., 0,) for each id in the list. Please don't do a single The best practice is to use If you need to get all employees from a list of one or more departments, then you simply need to do the same thing but for all of them and combine the results. If the number of depts is always a handful, you can do this sequentially (read in the index to fetch the employees id for first, then second, then third, ...) and merge all ids in an -- You can add more indexes, like for example by email, by manager id, by date of hiring, by role, etc... simply add new prefixes (2, 3, 4, ...). For 'unique' indexes (meaning for each indexed value there is only one entity), you can simplify the key format by storing the document id in the value instead of in the key:
This is a little bit more compact, but using the generic "non-unique" key format would work as well. -- Please remember that FoundationDB is very slow if you do everything sequentially (reading a key, awaiting the result, and from that get the name of another key to read, etc...), but very fast if you can combine range reads (read keys that are contiguous in the key space) and parallelized reads (via GetValues or by spawning a lot of tasks in c# and merging with Task.WhenAll). But when you have something working and that handle most cases, then you may want to be able to precisly see all the queries that are performed, and be able to measure them. That's where the embedded profiler comes in. You can easily enable the capture of all requests, and produce a log for each transaction that you can either dump in the console, write to a file, or send to some other service. On each transaction, you can set a "log handler" which is callback that is invoked with an foundationdb-dotnet-client/FoundationDB.Client/DependencyInjection/FdbDatabaseProviderOptions.cs Line 47 in d305e11
To simplify your life, this log object has a convenient method This is also very useful in unit tests, where you can dump the exact queries that failed to the test output (see for example https://github.com/Doxense/foundationdb-dotnet-client/blob/master/FoundationDB.Tests/Layers/DirectoryFacts.cs#L60 ) This method produces outputs like this:
You can add "comments" or "labels" to the trace at any point by calling |
Hi, I have a few designs to implement the same functionality in my code.
One involves using a tuple and storing everything in the same directory.
Like
Employees
("/employees/IT/","employee1"):{employeeDetails} ("/employees/Marketing/","employee2"):{employeeDetails}
Where '/employees/IT/' and '/employees/Marketing' is a part of a tuple.
Another involves using different directories for different departments.
Like
Employees
--IT
-----employee1:{employeeDetails}
--Marketing
-----employee2:{employeeDetails}
Now each design has its own pros and cons and there are a certain set of requirements for them.
Like
Get an employee
Get all employees
Get employees from one dept
Get all employees from X depts
and some more
I want to benchmark which design is the better one.
So far I have gone through the benchmark on the FDB official website.
I have also checked out benchmarking using YSCB.
I have also recorded the net trace while running the test code and I feel there is too much information for me to comprehend.
Please point me in the direction on how to benchmark certain design patterns in foundation db and I'll be more than grateful.
The text was updated successfully, but these errors were encountered: