-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.md.backup
117 lines (79 loc) · 3.96 KB
/
README.md.backup
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
# dhsc_data_tools package
**In active development**
The goal of DHSCdatatools is to provide a suite of tools for using data hosted on the DHSC analytical cloud (DAC) platform.
[For detailed developer documentation click here.](https://github.com/DataS-DHSC/dhsc-data-tools/tree/main/docs/dhsc_data_tools)
## Pre-requisites
A new conda environment is recommended for the package. In Git Bash:
```
conda create -n <your_environment_name> pip python-dotenv
```
## To install the dhsc_data_tools package
In Git Bash, with your relevant environment activated:
```
pip install git+https://github.com/DataS-DHSC/dhsc-data-tools.git
```
## .env and config files
A .env file containing tenant name and key vault name is required for `dhsc_data_tools.dac_odbc.connect()` and `dhsc_data_tools.keyvault.kvConnection()`.
Please find the .env file in the [Data Science Teams space DAC channel](https://teams.microsoft.com/l/channel/19%3ad94b5e4692d043249285162a04b35d12%40thread.tacv2/DAC%2520(DHSC%2520analytical%2520cloud)?groupId=88d91456-9588-4bed-a713-fde91b11a227&tenantId=61278c30-91a8-4c31-8c1f-ef4de8973a1c).
`dhsc_data_tools.remote_compute.connect_cluster()` requires and .config file built in the following manner:
[CONFIG-PROFILE-NAME]
TOKEN=XXX # this is your Databricks access token
HOST=XXX # this is the host id (not the host name)
CLUSTER_ID=XXX # this is the id of the Databricks compute cluster
Both these files should be in your working directory.
**IMPORTANT**
**Ensure in each project your `.gitignore` file excludes `.env` and `.config` files.**
If you do accidentally commit these files (or any other sensitive data) please get in touch with the [Data Science Hub](mailto:[email protected]) to discuss how best to mitigate the breach.
## Example use scripts
### Connecting to the DAC data using an SQL endpoint
```python
from dhsc_data_tools import dac_odbc
from dotenv import load_dotenv
load_dotenv(".env")
#create client
conn = dac_odbc.connect()
# Run a SQL query by using the preceding connection.
cursor = conn.cursor()
cursor.execute("SELECT * FROM samples.nyctaxi.trips LIMIT 10")
# Print the rows retrieved from the query.
for row in cursor.fetchall():
print(row)
```
### Retrieve keyvault secrets
```python
from dhsc_data_tools import keyvault
from pypac import pac_context_for_url
from dotenv import load_dotenv
load_dotenv(".env")
#create keyvault client
kvc = keyvault.kvConnection("DEV") #kvConnection() defaults to PROD
#get secret
with pac_context_for_url(kvc.KVUri):
my_key = kvc.get_secret('dummy-example-key')
print(my_key)
```
### Connect to a Databricks remote compute cluster
```python
from dhsc_data_tools import remote_compute
from dotenv import load_dotenv
load_dotenv(".env")
spark = remote_compute.connect_cluster()
example_data = [('James','','Smith','1991-04-01','M',3000),
('Michael','Rose','','2000-05-19','M',4000),
('Robert','','Williams','1978-09-05','M',4000),
('Maria','Anne','Jones','1967-12-01','F',4000),
('Jen','Mary','Brown','1980-02-17','F',-1)
]
example_columns = ["firstname","middlename","lastname","dob","gender","salary"]
example_df = spark.createDataFrame(data=example_data,
schema = example_columns)
type(example_df)
example_df.show()
```
## Code of Conduct
Please note that the DHSCdatatools project is released with a [Contributor Code of Conduct](https://contributor-covenant.org/version/2/1/CODE_OF_CONDUCT.html).
By contributing to this project, you agree to abide by its terms.
## Licence
Unless stated otherwise, the codebase is released under the MIT License. This covers both the codebase and any sample code in the documentation.
All other content is [© Crown copyright](http://www.nationalarchives.gov.uk/information-management/re-using-public-sector-information/uk-government-licensing-framework/crown-copyright/)
and available under the terms of the [Open Government 3.0 licence](https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/), except where otherwise stated.