Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weather Data Plots #168

Open
wants to merge 67 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
1134def
Added checkbox weather and fixed layout
Apr 16, 2020
fd39e45
Added script to retrieve data from weather API
an6eel Apr 17, 2020
785b046
Added functions to retrieve weather data for each state
an6eel Apr 20, 2020
fc358b4
Added config file
an6eel Apr 20, 2020
89ea987
added cron task to get weather data
an6eel Apr 20, 2020
91f1389
Added route to get weather data
an6eel Apr 20, 2020
2987373
Added weather DAG
an6eel Apr 22, 2020
ed4b19c
Added functions to download weather data locally or from the cloud
an6eel Apr 23, 2020
5962bf5
Added config to weather route
an6eel Apr 23, 2020
f8ee0e1
Added documentation about weather data
an6eel Apr 23, 2020
6593686
Fixed get weather task
an6eel Apr 24, 2020
5913826
Fixed get weather functionality
an6eel Apr 24, 2020
8df1b63
Fixed weather route and requirements
an6eel Apr 24, 2020
44d0de3
logic ready to work with checkbox
Apr 22, 2020
3927b28
improved if efficiency
Apr 22, 2020
7619b82
improved mobile view
Apr 22, 2020
4e07afe
draw weather with test data
Apr 22, 2020
e84d1b0
added highchart more library
Apr 22, 2020
99cd54f
major changes
Apr 22, 2020
fbefd2d
formatted weather highchart plot
Apr 22, 2020
ee2d54a
if checkbox desactivated weather removed
Apr 22, 2020
4e5661b
increased top space between weather plots
Apr 22, 2020
ac882b0
changed label text
Apr 22, 2020
15305a9
improved boostrap position options buttons
Apr 22, 2020
f93da2f
remove unnecessary data
Apr 22, 2020
741ae08
improved showing one or several charts for counties
Apr 23, 2020
0037ebc
refractor code, organise functions
Apr 23, 2020
d7c0360
fixed exception id not found
Apr 23, 2020
d9e0f75
Added visists data
Apr 23, 2020
5e4b991
improved doc
Apr 24, 2020
dbb6f2b
weather colors are now lighten
Apr 24, 2020
2739a31
Merge branch 'feature/weather-scripts' into cleanbranch
Apr 27, 2020
3448675
Integration with backend and colors plot improved
Apr 27, 2020
09c5d50
weather data preloaded
Apr 27, 2020
d4a8280
.gitignore added pycache
Apr 27, 2020
87d67e0
improved charts
Apr 27, 2020
d9c3764
added toggle between visits and chart plot
Apr 27, 2020
3eb2a2b
Fixed airflow task
an6eel Apr 27, 2020
2ba4e56
Fixed configuration to deploy
an6eel Apr 27, 2020
2075730
Merge branches 'feature/weather-info' and 'feature/weather-scripts' o…
an6eel Apr 27, 2020
cf46a04
Fixed configuration to deploy
an6eel Apr 27, 2020
90b3274
fix to cross data
Apr 27, 2020
df613c3
fixed buttons condition
Apr 27, 2020
0914b9f
plot tag min/max fixed
Apr 27, 2020
3c3e2a3
Merge branch 'feature/weather-info' of github.com:strivelabs/ca_visit…
Apr 27, 2020
4cd73f1
Improved gui and buttons in screen
Apr 28, 2020
73f21ec
added weather instructions to notes
Apr 28, 2020
df60024
Fixed fetch route
an6eel Apr 28, 2020
8931f31
Merge branch 'feature/weather-info' of https://github.com/celtiberian…
an6eel Apr 28, 2020
de92759
merge weather upstream master
Apr 29, 2020
d627953
visits % to #. Improved code corrections
Apr 30, 2020
3c80ae3
merge weather upstream master 30/04/20
Apr 30, 2020
650709b
weather data var created
Apr 30, 2020
22ebba7
weather data for non selected venue type
Apr 30, 2020
0ee5f84
rounded y visits value
Apr 30, 2020
90a6e38
Merge pull request #1 from strivelabs/feature/300420-weather-info
an6eel Apr 30, 2020
30049e9
update 300420
Apr 30, 2020
5b9b5e7
update requirements
an6eel May 4, 2020
db6f0af
update 040520
May 4, 2020
7452675
update 050520
May 5, 2020
fd01848
app.yaml update 050520
May 5, 2020
4caf68a
weather.js: precipitations typo fixed
May 5, 2020
9577f21
weahter.js: weather remove min temperatures
May 5, 2020
5f6c636
Added Venue Type to title
May 5, 2020
24361cb
Merge pull request #3 from strivelabs/features/weather-anne-050520
an6eel May 5, 2020
a1cdc80
20200504
ajanian May 5, 2020
85c0546
update 070520
May 7, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,5 @@ localdata/
*.iml
datascratch/
.gcpprj

scripts/__pycache__
30 changes: 30 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,36 @@ Historic data snapshots are also hosted on https://data.visitdata.org/
For example, you can retrieve
https://data.visitdata.org/processed/vendor/foursquare/asof/20200403-v0/taxonomy.json

# Weather Data

Weather data is retrieved from weatherapi.com. After retrieving the data, it creates a file with the following structure:
`<state>.json:`
```
{
<county-1>: {
forecast: {
day-1-timestamp: { <weather_data> },
day-2-timestamp: { <weather_data> },
...
}
},
<county-2>: { .... }
...
}
```

In the **production environment**, a task will be exececuted daily (`etl/dags/extract_weather_data.py`) to store
the weather data into a google cloud bucket.

`BUCKET_NAME` variable must be defined in the airflow server

Once the data is stored in the bucket, we can get the weather data for one state in the route
`weather/<state>`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to consolidate all of this under https://visitdata.org/data/ and present it as a single bundle that all goes together.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markroth8 the only issue is that it would increase the size of the initial data load for each page before it renders. If we assume that in most cases, people will be looking at the visits and not the weather, it would be better to load it only upon request.
In general if we plan to add more data sources which might be only displayed based on user's choices in the UI, the architecture of separating those data sources might be better?


To get that data from the google cloud the `BUCKET_NAME` environment variable must be defined. Otherwise,
data will be stored locally in the path `localdata/`


# Importing new data
To import new data:

Expand Down
3 changes: 2 additions & 1 deletion app.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,5 @@ runtime: python37
entrypoint: gunicorn -b :$PORT main:app

env_variables:
FOURSQUARE_DATA_VERSION: "20200503-v0"
FOURSQUARE_DATA_VERSION: "20200504-v0"
BUCKET_NAME: "vd-weather-data"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a WEATHERAPI_DATA_VERSION variable to set the as of date for the weather data as part of the deployment. We can use the same bucket (data.visitdata.org).

94 changes: 94 additions & 0 deletions etl/dags/extract_weather_data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
from datetime import timedelta, datetime
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.utils.dates import days_ago
import json
from google.cloud.storage import Client
import requests
from deepmerge import always_merger
import os

BASE_URL = "http://api.weatherapi.com/v1/history.json?key={}&q={}+united+states&dt={}"
# Get a weatherapi.com api key
API_KEY = os.environ.get("API_WEATHER_KEY", "a70a4e2736644cdcb9d85348202404")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keys must never be committed to version control. This must be removed and the history squashed before merging. We can register the API key as an airflow secret.

BUCKET_NAME = os.environ.get("BUCKET_NAME", "default")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should fail if env not set, to force correct configuration, instead of silently choosing an invalid bucket name.


default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': days_ago(1),
'retries': 1,
'retry_delay': timedelta(minutes=5)
}

gcs = Client()
bucket = gcs.bucket(BUCKET_NAME)
state_file = bucket.get_blob("states_counties.json")
STATES = json.loads(state_file.download_as_string())

def slugify_state(state):
return "-".join(state.split())

def get_weather_data(query):
weather = {}
weather["forecast"] = {}
date = datetime.today()
full_url = BASE_URL.format(API_KEY, query, date.strftime('%Y-%m-%d'))
response = requests.get(full_url)
data = response.json()
try:
forecast = data["forecast"]["forecastday"][0]
location = data["location"]
forecast["day"].pop("condition")
weather = {**weather, **location}

weather["forecast"][forecast["date_epoch"]] = forecast["day"]
except:
return weather

return weather

def weather_func_builder(state):
selected_state = state
def get_weather():
data = {"updated": datetime.now().strftime('%Y-%m-%d %H:%M:%S')}
counties = STATES[selected_state]
blob = bucket.get_blob("{}.json".format(selected_state))
if blob is None:
stated_cached_data = {}
else:
stated_cached_data = json.loads(blob.download_as_string())
for county in counties:
api_data = get_weather_data(county)
cached_data = stated_cached_data.get(county, {})
data[county] = always_merger.merge(cached_data, api_data)

state_blob = bucket.blob("{}.json".format(selected_state))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proposed: Let's use gs://data.visitdata.org/processed/vendor/api.weatherapi.com/asof/yyyymmdd/*.json to store this data.

state_blob.upload_from_string(json.dumps(data))

return True
return get_weather



def create_dag(dag_id, state):
dag = DAG(
dag_id=dag_id,
description="Weather DAG",
default_args=default_args,
schedule_interval='@daily'
)


get_data_api = PythonOperator(
task_id="get-data-{}".format(slugify_state(state)),
python_callable=weather_func_builder(state),
dag=dag
)

return dag

for state in STATES.keys():
dag_id = "{}-weather".format(slugify_state(state))
globals()[dag_id] = create_dag(dag_id, state)

2 changes: 2 additions & 0 deletions etl/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
apache-airflow==1.10.10
google-cloud-storage==1.27.0
requests
deepmerge
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pin version in requirements.txt so we know we're running the same thing.

19 changes: 19 additions & 0 deletions main.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,13 @@
import yaml
from flask import Flask, redirect, render_template, request
from google.cloud import storage
from scripts.weather import get_state_weather_locally, get_state_weather_cloud


app = Flask(__name__, static_url_path="", static_folder="static")
app_state = {
"maps_api_key": "",
"weather_path_data": "vd-weather-data",
"foursquare_data_url": "",
"foursquare_data_version": ""
}
Expand Down Expand Up @@ -143,6 +145,13 @@ def data(path):
snapshot_id=app_state['foursquare_data_version'])


@app.route("/weather/<state>")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can just get this from /data/weather/<state>

def weather(state):
if app_state["weather_path_data"] != "":
return get_state_weather_cloud(state, app_state["weather_path_data"])
else:
return get_state_weather_locally(state)

def page_not_found(e):
return render_template('404.html'), 404

Expand Down Expand Up @@ -183,11 +192,21 @@ def _init_data_env():
app_state["foursquare_data_url"] =\
f"//data.visitdata.org/processed/vendor/foursquare/asof/{foursquare_data_version}"

def _init_weather_data_env():
# Gcloud bucket name
bucket_name = os.getenv("BUCKET_NAME", "vd-weather-data")

if bucket_name == "":
error("Weather data will be stored locally")

app_state["weather_path_data"] = bucket_name


def _init():
app.config["SEND_FILE_MAX_AGE_DEFAULT"] = 60
app.register_error_handler(404, page_not_found)
_init_maps_api_key()
_init_weather_data_env()
_init_data_env()
print(app_state)

Expand Down
4 changes: 3 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
Flask==1.1.1
gunicorn==19.10.0
gunicorn==20.0.
google-cloud-storage==1.27.0
pyyaml==5.3.1
requests
deepmerge
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pin version in requirements.txt so we know we're all running the same thing.

7 changes: 7 additions & 0 deletions scripts/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
BASE_URL = "http://api.weatherapi.com/v1/history.json?key={}&q={}+united+states&dt={}"
API_KEY = "a70a4e2736644cdcb9d85348202404"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API keys must never be committed to version control. This must be removed and squashed before merging.

DATA_PATH = "localdata/"

NO_STATE_ERROR_RESPONSE = {
"error": "There is no data for that state"
}
62 changes: 62 additions & 0 deletions scripts/weather.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
import requests, json
from datetime import datetime, timedelta
from deepmerge import always_merger
from scripts.config import *
from google.cloud import storage


def get_weather_data(query, limit=30):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At runtime we shouldn't reach out to the weather API (and especially not make several calls - we will probably have our API key revoked if our traffic spikes). The ETL job already downloads this offline and puts it in a bucket, so we should access from there (and then probably cache in memory similar to main._list_names()).

weather = {}
weather["forecast"] = {}
for day in range(limit, -1, -1):
date = datetime.today() - timedelta(days=day)
full_url = BASE_URL.format(API_KEY, query, date.strftime('%Y-%m-%d'))
response = requests.get(full_url)
data = response.json()
try:
forecast = data["forecast"]["forecastday"][0]
location = data["location"]
forecast["day"].pop("condition")
weather = {**weather, **location}

weather["forecast"][forecast["date_epoch"]] = forecast["day"]
except :
continue

return weather

def __load_state_file(state):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dunders are typically reserved for the Python Core API team. A single _ prefix would suffice here. See https://amontalenti.com/2013/04/11/python-double-under-double-wonder

try:
with open("{}{}.json".format(DATA_PATH, state)) as f:
data = json.load(f);
return data
except FileNotFoundError:
return {}

def __update_state_file(state, data):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change dunder to single underscore.

with open("{}{}.json".format(DATA_PATH, state), "w+") as f:
json.dump(data, f)

def get_state_weather_locally(state):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was good for debug mode, but once the ETL works, we can probably remove the local access in favor of always reading from the bucket, similar to the visit data.

STATES = {}

with open("states_counties.json") as f:
STATES = json.load(f)

if state not in STATES:
return json.dumps(NO_STATE_ERROR_RESPONSE)
cached_data = __load_state_file(state)
for county in STATES[state]:
weather_data = get_weather_data(county)
county_data = cached_data.get(county, {})
cached_data[county] = always_merger.merge(county_data, weather_data)
__update_state_file(state, cached_data)
return json.dumps(cached_data)

def get_state_weather_cloud(state, bucket_name):
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
file = bucket.get_blob("{}.json".format(state))
return json.loads(file.download_as_string()) if file is not None else {}


24 changes: 5 additions & 19 deletions static/css/visitdata.css
Original file line number Diff line number Diff line change
Expand Up @@ -59,10 +59,6 @@ h4 {
margin-top: 20px;
}

.mobile-dropdown {
margin-left: 60px;
}

}

@media (min-width: 990px) and (max-width: 1199px) {
Expand All @@ -83,10 +79,6 @@ h4 {
height: 566px;
}

.mobile-btn {
margin-left: 25px;
}

.mobile-header {
margin-top: 15px;
}
Expand All @@ -98,10 +90,6 @@ h4 {
.mobile-search {
margin-top: 20px;
}

.mobile-dropdown {
margin-left: 60px;
}
}

@media (max-width: 989px) {
Expand All @@ -118,10 +106,6 @@ h4 {
max-width: 240px;
}

.mobile-btn {
margin-left: 10px;
}

.mobile-header {
margin-top: 15px;
}
Expand All @@ -130,9 +114,6 @@ h4 {
margin-top: 20px;
}

.mobile-dropdown {
margin-left: 60px;
}
}

@media (min-width: 745px) and (max-width: 815px) {
Expand Down Expand Up @@ -193,3 +174,8 @@ h4 {
margin-left: 0px;
}
}

#weather-label {
text-overflow: ellipsis;
overflow: hidden;
}
Loading