We have a business need to build a scheduled task which will flag anomolous in a database given to us by a third party.
The main requirements given to our team are:
- We need to connect to a database and retrieve data
- We need to identify duplicates in the database and flag them as such
- We need to identify outlier datapoints in the database and flag them as such
- We do not control the data source, and we have read-only access to the database.
You should create a new branch and iterate on the code in this repository to meet the requirements above. Please leave comments in the code to explain your thought process and any changes you make. Push commits to your branch in order to trigger the github actions workflow. You can view the results of the workflow in the actions tab of the repository.
If you have time, please also consider the following:
- How would you test this code?
- How can the database connection handling be improved?
- How would you handle a large amount of data?