Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add doc sections for further development #4

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file not shown.
Binary file not shown.
Binary file not shown.
6 changes: 6 additions & 0 deletions AmazonWebServices/AWSUtility.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@

## Lab guidelines for using Lab AWS Utility Package

https://github.com/Watts-Lab/AWSUtil


41 changes: 41 additions & 0 deletions AmazonWebServices/Glue_BestPractices.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@

[Scaling and Partitioning Data](https://aws.amazon.com/blogs/big-data/best-practices-to-scale-apache-spark-jobs-and-partition-data-with-aws-glue/)
- AWS Glue provides a serverless environment to prepare (extract and transform) and load large amounts of datasets from a variety of sources for analytics and data processing with Apache Spark ETL jobs
- Two scaling methods:
- The first allows you to horizontally scale out Apache Spark applications for large splittable datasets.
- The second allows you to vertically scale up memory-intensive Apache Spark applications with the help of new AWS Glue worker types
- Understanding AWS Glue worker types
- Standard, G.1X, and G.2X
- AWS Glue jobs that need high memory or ample disk space to store intermediate shuffle output can benefit from vertical scaling (more G1.X or G2.x workers)
- Horizontal scaling for splittable datasets
- A file split is a portion of a file that a Spark task can read and process independently on an AWS Glue worker


[Glue Bookmarks and Glue Optimized Parquet Writer](https://aws.amazon.com/blogs/big-data/load-data-incrementally-and-optimized-parquet-writer-with-aws-glue/)
- Glue Optimized Apache Parquet writer
- Unlike the default Apache Spark Parquet writer, it does not require a pre-computed schema or schema that is inferred by performing an extra scan of the input dataset. --> `glueparquet`
- The AWS Glue Parquet writer also enables schema evolution by supporting the deletion and addition of new columns


[Glue automatic code generation and Workflows](https://aws.amazon.com/blogs/big-data/simplify-data-pipelines-with-aws-glue-automatic-code-generation-and-workflows/)
- ...

[Optimize Memory Management](https://aws.amazon.com/blogs/big-data/optimize-memory-management-in-aws-glue/)
- ...

[Developing Glue ETL jobs locally using a Docker container](https://aws.amazon.com/blogs/big-data/developing-aws-glue-etl-jobs-locally-using-a-container/)
- AWS Glue is built on top of Apache Spark and therefore uses all the strengths of open-source technologies.
- AWS Glue comes with many improvements on top of Apache Spark and has its own ETL libraries that can fast-track the development process and reduce boilerplate code.

```
// docker image pull
docker pull amazon/aws-glue-libs:glue_libs_3.0.0_image_01

// run docker image (REPL)
docker run -it -v ~/.aws:/home/glue_user/.aws -e AWS_ACCESS_KEY_ID=... -e AWS_SECRET_ACCESS_KEY=... -e DISABLE_SSL=true --rm -p 4040:4040 -p 18080:18080 --name glue_pyspark amazon/aws-glue-libs:glue_libs_3.0.0_image_01 pyspark


// check for interactive dev endpoints
aws glue list-dev-endpoints
```

46 changes: 46 additions & 0 deletions AmazonWebServices/ServiceDescriptions/DeployingWebApplications.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@

### Deploying Web Applications

The following are AWS Services useful for deploying web applications.

**Related Concepts:**
- Containerized Applications

### AWS App Runner
- [Summary](https://aws.amazon.com/apprunner/)
- [FAQ](https://aws.amazon.com/apprunner/faqs/)
AppRunner is a fully managed container deployment service. Supports following infrastrucutre development steps:
- AutoBuild & Deploy
- Load Balancing
- Autoscaling
- Certificate Setup
- Metric Dashboard

**Use Cases**
- For quickly developing and deploying lightweight applications with exepectedly low demand
-

**Integration & Configuration Steps**
1) Connect to source (recommend ECR Image Upload)
2) Configure service
- Specify vCPUs and Memory (resources) for containers
3) Launch

**Pricing Model**
- Provisioned Instances: $0.007 / GB hour
- Active Instances: $0.064 / vCPU hour + $0.007 / GB hour
- Cloudwatch logs: $0.50 GB

**Gotchas**
- VPC connectivity is not yet supported
- Can't scale down to 0

**Demo Videos/Articles**
- https://www.youtube.com/watch?v=TKirecwhJ2c
- https://www.youtube.com/watch?v=SVfIdT38i9I
- https://www.youtube.com/watch?v=x_1X_4j16A4
- https://www.reddit.com/r/aws/comments/v1hax5/aws_app_runner_deep_dive_my_summary/
- https://aws.amazon.com/blogs/aws/new-for-app-runner-vpc-support/
- https://semaphoreci.com/blog/aws-app-runner


22 changes: 22 additions & 0 deletions AmazonWebServices/StoringVideoData.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@

### Storing Video Data Efficiently


**Storing Video Data Itself**


**Storing Links to Video Data**


https://s3.eyeson.com

/meetings/62dadc5861b4b000107a87d3

/62dadc5d2e5c55000f18fa1e?X-Amz-Algorithm=AWS4-HMAC-SHA256

&X-Amz-Credential=AH07FBAZO9IYS38J6742%2F20220722%2Fus-east-1%2Fs3%2Faws4_request
&X-Amz-Date=20220722T181016Z
&X-Amz-Expires=3600
&X-Amz-SignedHeaders=host
&X-Amz-Signature=ed50a7593edb7da8aabf548be4b5b8551560dd23cbd57616351f325d1c347ba2

Binary file removed Git_GitHub_Guide/Git GitHub Guide_20211216.pdf
Binary file not shown.
72 changes: 72 additions & 0 deletions HighPerformanceComputeCluster/UsePatterns.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@

## High Performance Commputer Cluster (HPCC)

From [Wharton's Documentation](https://research-it.wharton.upenn.edu/documentation/):
```
The Wharton School HPC Cluster is a 32-node, 512-core Linux cluster environment designed to support the school’s academic research mission. It is managed collaboratively by Wharton Computing’s Research and Innovation and Core Services teams.
```

## Use Patterns via the command line

**Standard Job / Job Arrays (qsub) and Head Node Interactive Development (qlogin)**
- Submitting job arrays and developing interactively on the HPCC head node do not incur any marginal cost
- For Job configurations requiring GPUs, we are able to submit jobs at no marginal cost in free tier, but jobs will automatically shut down after 4 hours
- for each job submission, we are able to set the:
- GPU number (-l gpu),
- Available Memory (-l m_mem_free),
- process count (-t) which sets the count of tasks for a job submitted as a job array

**Cloud Bursting Job / Job Array Submission (qsub-aws) & Interactive Development (qlogin-aws)**
- https://research-it.wharton.upenn.edu/documentation/cloud-bursting/
- We pay for hours of dedicated instance usage, at a 40% discount
- The AWS instances would only be running when jobs are submitted to and running in the dedicated queue. These shutdown as jobs complete. We only pay for the hours they are running jobs, so excludes start up, shut down, and any idle time. AWS instances will be automatically shut down when idle for one billable hour to conserve budget
- The billing is sent out the month after, for the previous month of usage.


**Monitoring Active Use**
- You will be able to see the current usage costs with the `unicloud-dbr` command on the hpcc-login node.


## General Use

These are general guidelines I use when using HPCC for

**Logging Into HPCC Outside Penn Network: VPN**
- [guide from Wharton](https://support.wharton.upenn.edu/help/wharton-vpn#connecting-to-the-vpn)

**Using the Head Node via `qlogin`. This is for interactive development, not for large jobs**
- `ssh [email protected]`
- `qlogin -now no`
- `python -m venv myvenv` --> to create a new virtual env
- `module load python/python-3.9.6` --> Setting default Python version
- `module load gcc/gcc-11.1.0`
- `source myenv0/bin/activate` --> given 'myenv0' is your venv
- `pip install -U pip`
- `pip install -U setuptools wheel`
- create requirements.txt file for virtual env
- `pip install -r requirements.txt`

**authenticate AWS Access:**
- Example command:
- `alias aws-login-pennmap="aws-federated-auth --account 088838630371 --user mriv;export AWS_PROFILE=aws-seas-wattslab-acct-PennAccountAdministrator"`

**Running Jupyter when connected via VPN**
- ask Wharton Support to set up port forwarding on your HPCC environment
Follow these steps:
- follow General `qlogin` steps above
- `jupyter lab &`
- Open another terminal window on your local computer and copy / paste the ssh tunnel command from the output from above (below is just an example, it won't work for you):
- `ssh [email protected] -f -N -L 47686:hpcc019:47686`
- Then open a local browser and copy / paste the 127.0.0.1 URL from the output from the 'jupyter lab' command, similar (but not the same!) as the example below:
- `http://127.0.0.1:47686/lab?token=41a0db54573edaf50e661b3a88894dbe28c4db9003f`


**Submitting Jobs:**
- [Wharton Best Practices](https://research-it.wharton.upenn.edu/documentation/programming-best-practices/)
- write module ("demo.py") that will execute upon being invoked via `python demo.py`
- write submission script, `demo.sh` that simply contains the `python demo.py` command
- submit to the cluster with `qsub demo.sh`
- output will be in the same directory as demo.sh
- check on submitted job status:
- qstat: displays the status of the HPCC queues, by default displaying your job information. It includes running and queued jobs
- [See Wharton Documentation](https://research-it.wharton.upenn.edu/documentation/job-management/)
4 changes: 4 additions & 0 deletions MechanicalTurk/LabAccountOrg.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@

## CSSLab MTurk Account Organization


File renamed without changes.
3 changes: 3 additions & 0 deletions MechanicalTurk/Panel.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@

## The Panel

21 changes: 21 additions & 0 deletions MechanicalTurk/WorkerPaymentProcess.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@

### Current Payment Notification Process: Deliberation-Empirica

- The following details the current payment process for Mechanical Turk (MTurk) survey-takers used in tasks involved with the Deliberation-Empirica project that involve MTurk workers drawn from the Panel
- Deliberation-Empirica repository [here](https://github.com/Watts-Lab/deliberation-empirica)
- From a high level, the payment process is currently executed manually via a script found in the Turk-Interface [repository](https://github.com/Watts-Lab/deliberation-project/blob/main/survey_workflows/sending-notifications.R)
- The process has three steps, and is designed to include intermediate steps that allow for lab members to opine or intervene in the process. This is as opposed to a fully automated payment system with no opportunity for human input


- Receive file received with payment specifications
- This just includes a list of worker ID’s and associated Bonus Amounts to be distributed
- update `sending-notifications.R`, and runnnig each of the following sections independently as three steps:
- Section 1 —> Organizing Payments Data, line 714:
- primary function: create `unpaid_people` dataframe, which contains table of Worker ID's and associated payment amount
- Section 2 —> Registering Payments Data for Internal Review, line 744:
- primary function: create and send notifcation of payment summary to monitoring-panel slack channel
- this allows for team to review payment summary before payments are actually distributed, in case something needs to be addressed.
- Seciton 3 —> Execute Payments, line 769:
- Primary function: Execute payment of workers listed in the `unpaid_people` table
- Involves authentication with MTurk API and MTurk payment token

File renamed without changes.
File renamed without changes.