Watts-Lab · rivera-lanasm · Jul 13, 2022 · Jul 15, 2022 · Jul 15, 2022 · Jul 19, 2022
diff --git a/AWS_Onboarding/On-Boarding E-Mail #1 (Watt's Lab).pdf b/AWS_Onboarding/On-Boarding E-Mail #1 (Watt's Lab).pdf
diff --git a/AWS_Onboarding/On-Boarding E-Mail #2 (Watt's Lab).pdf b/AWS_Onboarding/On-Boarding E-Mail #2 (Watt's Lab).pdf
diff --git a/AWS_Onboarding/On-Boarding E-Mail #3 (Watt's Lab).pdf b/AWS_Onboarding/On-Boarding E-Mail #3 (Watt's Lab).pdf
diff --git a/AmazonWebServices/AWSUtility.md b/AmazonWebServices/AWSUtility.md
@@ -0,0 +1,6 @@
+
+## Lab guidelines for using Lab AWS Utility Package
+
+https://github.com/Watts-Lab/AWSUtil
+
+
diff --git a/AmazonWebServices/Glue_BestPractices.md b/AmazonWebServices/Glue_BestPractices.md
@@ -0,0 +1,41 @@
+
+[Scaling and Partitioning Data](https://aws.amazon.com/blogs/big-data/best-practices-to-scale-apache-spark-jobs-and-partition-data-with-aws-glue/)
+- AWS Glue provides a serverless environment to prepare (extract and transform) and load large amounts of datasets from a variety of sources for analytics and data processing with Apache Spark ETL jobs
+- Two scaling methods:
+    - The first allows you to horizontally scale out Apache Spark applications for large splittable datasets. 
+    - The second allows you to vertically scale up memory-intensive Apache Spark applications with the help of new AWS Glue worker types
+- Understanding AWS Glue worker types
+    - Standard, G.1X, and G.2X
+    - AWS Glue jobs that need high memory or ample disk space to store intermediate shuffle output can benefit from vertical scaling (more G1.X or G2.x workers)
+- Horizontal scaling for splittable datasets
+    - A file split is a portion of a file that a Spark task can read and process independently on an AWS Glue worker
+
+
+[Glue Bookmarks and Glue Optimized Parquet Writer](https://aws.amazon.com/blogs/big-data/load-data-incrementally-and-optimized-parquet-writer-with-aws-glue/)
+- Glue Optimized Apache Parquet writer
+    - Unlike the default Apache Spark Parquet writer, it does not require a pre-computed schema or schema that is inferred by performing an extra scan of the input dataset. --> `glueparquet`
+    - The AWS Glue Parquet writer also enables schema evolution by supporting the deletion and addition of new columns
+
+
+[Glue automatic code generation and Workflows](https://aws.amazon.com/blogs/big-data/simplify-data-pipelines-with-aws-glue-automatic-code-generation-and-workflows/)
+- ...
+
+[Optimize Memory Management](https://aws.amazon.com/blogs/big-data/optimize-memory-management-in-aws-glue/)
+- ...
+
+[Developing Glue ETL jobs locally using a Docker container](https://aws.amazon.com/blogs/big-data/developing-aws-glue-etl-jobs-locally-using-a-container/)
+- AWS Glue is built on top of Apache Spark and therefore uses all the strengths of open-source technologies. 
+- AWS Glue comes with many improvements on top of Apache Spark and has its own ETL libraries that can fast-track the development process and reduce boilerplate code.
+
+```
+// docker image pull
+docker pull amazon/aws-glue-libs:glue_libs_3.0.0_image_01
+
+// run docker image (REPL)
+docker run -it -v ~/.aws:/home/glue_user/.aws -e AWS_ACCESS_KEY_ID=... -e AWS_SECRET_ACCESS_KEY=... -e DISABLE_SSL=true --rm -p 4040:4040 -p 18080:18080 --name glue_pyspark amazon/aws-glue-libs:glue_libs_3.0.0_image_01 pyspark
+
+
+// check for interactive dev endpoints 
+aws glue list-dev-endpoints
+```
+
diff --git a/AmazonWebServices/ServiceDescriptions/DeployingWebApplications.md b/AmazonWebServices/ServiceDescriptions/DeployingWebApplications.md
@@ -0,0 +1,46 @@
+
+### Deploying Web Applications
+
+The following are AWS Services useful for deploying web applications. 
+
+**Related Concepts:**
+- Containerized Applications
+
+### AWS App Runner
+- [Summary](https://aws.amazon.com/apprunner/)
+- [FAQ](https://aws.amazon.com/apprunner/faqs/)
+AppRunner is a fully managed container deployment service. Supports following infrastrucutre development steps:
+- AutoBuild & Deploy
+- Load Balancing
+- Autoscaling
+- Certificate Setup
+- Metric Dashboard
+
+**Use Cases**
+- For quickly developing and deploying lightweight applications with exepectedly low demand
+- 
+
+**Integration & Configuration Steps**
+1) Connect to source (recommend ECR Image Upload)
+2) Configure service
+- Specify vCPUs and Memory (resources) for containers 
+3) Launch 
+
+**Pricing Model**
+- Provisioned Instances: $0.007 / GB hour
+- Active Instances: $0.064 / vCPU hour + $0.007 / GB hour
+- Cloudwatch logs: $0.50 GB
+
+**Gotchas**
+- VPC connectivity is not yet supported 
+- Can't scale down to 0
+
+**Demo Videos/Articles**
+- https://www.youtube.com/watch?v=TKirecwhJ2c
+- https://www.youtube.com/watch?v=SVfIdT38i9I
+- https://www.youtube.com/watch?v=x_1X_4j16A4
+    - https://www.reddit.com/r/aws/comments/v1hax5/aws_app_runner_deep_dive_my_summary/
+- https://aws.amazon.com/blogs/aws/new-for-app-runner-vpc-support/
+- https://semaphoreci.com/blog/aws-app-runner
+
+
diff --git a/AmazonWebServices/StoringVideoData.md b/AmazonWebServices/StoringVideoData.md
@@ -0,0 +1,22 @@
+
+### Storing Video Data Efficiently 
+
+
+**Storing Video Data Itself**
+
+
+**Storing Links to Video Data**
+
+
+https://s3.eyeson.com
+
+/meetings/62dadc5861b4b000107a87d3
+
+/62dadc5d2e5c55000f18fa1e?X-Amz-Algorithm=AWS4-HMAC-SHA256
+
+&X-Amz-Credential=AH07FBAZO9IYS38J6742%2F20220722%2Fus-east-1%2Fs3%2Faws4_request
+&X-Amz-Date=20220722T181016Z
+&X-Amz-Expires=3600
+&X-Amz-SignedHeaders=host
+&X-Amz-Signature=ed50a7593edb7da8aabf548be4b5b8551560dd23cbd57616351f325d1c347ba2
+
diff --git a/Git_GitHub_Guide/Git GitHub Guide_20211216.pdf b/Git_GitHub_Guide/Git GitHub Guide_20211216.pdf
diff --git a/HighPerformanceComputeCluster/UsePatterns.md b/HighPerformanceComputeCluster/UsePatterns.md
@@ -0,0 +1,72 @@
+
+## High Performance Commputer Cluster (HPCC)
+
+From [Wharton's Documentation](https://research-it.wharton.upenn.edu/documentation/):
+```
+The Wharton School HPC Cluster is a 32-node, 512-core Linux cluster environment designed to support the school’s academic research mission. It is managed collaboratively by Wharton Computing’s Research and Innovation and Core Services teams.
+```
+
+## Use Patterns via the command line
+
+**Standard Job / Job Arrays (qsub) and Head Node Interactive Development (qlogin)**
+- Submitting job arrays and developing interactively on the HPCC head node do not incur any marginal cost
+- For Job configurations requiring GPUs, we are able to submit jobs at no marginal cost in free tier, but jobs will automatically shut down after 4 hours
+- for each job submission, we are able to set the: 
+    - GPU number (-l gpu), 
+    - Available Memory (-l m_mem_free), 
+    - process count (-t) which sets the count of tasks for a job submitted as a job array
+
+**Cloud Bursting Job / Job Array Submission (qsub-aws) & Interactive Development (qlogin-aws)**
+- https://research-it.wharton.upenn.edu/documentation/cloud-bursting/
+- We pay for hours of dedicated instance usage, at a 40% discount
+- The AWS instances would only be running when jobs are submitted to and running in the dedicated queue. These shutdown as jobs complete. We only pay for the hours they are running jobs, so excludes start up, shut down, and any idle time. AWS instances will be automatically shut down when idle for one billable hour to conserve budget
+- The billing is sent out the month after, for the previous month of usage.
+
+
+**Monitoring Active Use**
+- You will be able to see the current usage costs with the `unicloud-dbr` command on the hpcc-login node.
+
+
+## General Use
+
+These are general guidelines I use when using HPCC for 
+
+**Logging Into HPCC Outside Penn Network: VPN**
+- [guide from Wharton](https://support.wharton.upenn.edu/help/wharton-vpn#connecting-to-the-vpn)
+
+**Using the Head Node via `qlogin`. This is for interactive development, not for large jobs**
+- `ssh [email protected]`
+- `qlogin -now no`
+- `python -m venv myvenv` --> to create a new virtual env
+- `module load python/python-3.9.6` --> Setting default Python version
+- `module load gcc/gcc-11.1.0`
+- `source myenv0/bin/activate` --> given 'myenv0' is your venv
+- `pip install -U pip`
+- `pip install -U setuptools wheel`
+- create requirements.txt file for virtual env
+- `pip install -r requirements.txt`
+
+**authenticate AWS Access:**
+- Example command:
+    - `alias aws-login-pennmap="aws-federated-auth --account 088838630371 --user mriv;export AWS_PROFILE=aws-seas-wattslab-acct-PennAccountAdministrator"`
+
+**Running Jupyter when connected via VPN**
+- ask Wharton Support to set up port forwarding on your HPCC environment
+Follow these steps:
+- follow General `qlogin` steps above 
+- `jupyter lab &`
+    - Open another terminal window on your local computer and copy / paste the ssh tunnel command from the output from above (below is just an example, it won't work for you):
+        -  `ssh [email protected] -f -N -L 47686:hpcc019:47686`
+    - Then open a local browser and copy / paste the 127.0.0.1 URL from the output from the 'jupyter lab' command, similar (but not the same!) as the example below:
+        - `http://127.0.0.1:47686/lab?token=41a0db54573edaf50e661b3a88894dbe28c4db9003f`
+
+
+**Submitting Jobs:**
+- [Wharton Best Practices](https://research-it.wharton.upenn.edu/documentation/programming-best-practices/)
+- write module ("demo.py") that will execute upon being invoked via `python demo.py`
+- write submission script, `demo.sh` that simply contains the `python demo.py` command
+- submit to the cluster with `qsub demo.sh`
+- output will be in the same directory as demo.sh
+- check on submitted job status:
+    - qstat: displays the status of the HPCC queues, by default displaying your job information. It includes running and queued jobs
+    - [See Wharton Documentation](https://research-it.wharton.upenn.edu/documentation/job-management/)
diff --git a/MechanicalTurk/LabAccountOrg.md b/MechanicalTurk/LabAccountOrg.md
@@ -0,0 +1,4 @@
+
+## CSSLab MTurk Account Organization 
+
+
diff --git a/MTurk.md → MechanicalTurk/MTurk.md b/MTurk.md → MechanicalTurk/MTurk.md
diff --git a/MechanicalTurk/Panel.md b/MechanicalTurk/Panel.md
@@ -0,0 +1,3 @@
+
+## The Panel
+
diff --git a/MechanicalTurk/WorkerPaymentProcess.md b/MechanicalTurk/WorkerPaymentProcess.md
@@ -0,0 +1,21 @@
+
+### Current Payment Notification Process: Deliberation-Empirica
+
+- The following details the current payment process for Mechanical Turk (MTurk) survey-takers used in tasks involved with the Deliberation-Empirica project that involve MTurk workers drawn from the Panel
+- Deliberation-Empirica repository [here](https://github.com/Watts-Lab/deliberation-empirica)
+- From a high level, the payment process is currently executed manually via a script found in the Turk-Interface [repository](https://github.com/Watts-Lab/deliberation-project/blob/main/survey_workflows/sending-notifications.R)
+- The process has three steps, and is designed to include intermediate steps that allow for lab members to opine or intervene in the process. This is as opposed to a fully automated payment system with no opportunity for human input 
+
+
+- Receive file received with payment specifications
+    - This just includes a list of worker ID’s and associated Bonus Amounts to be distributed
+- update `sending-notifications.R`, and runnnig each of the following sections independently as three steps:
+    - Section 1 —> Organizing Payments Data, line 714:
+        - primary function: create `unpaid_people` dataframe, which contains table of Worker ID's and associated payment amount
+    - Section 2 —> Registering Payments Data for Internal Review, line 744:
+        - primary function: create and send notifcation of payment summary to monitoring-panel slack channel
+        - this allows for team to review payment summary before payments are actually distributed, in case something needs to be addressed. 
+    - Seciton 3 —> Execute Payments, line 769:
+        - Primary function: Execute payment of workers listed in the `unpaid_people` table
+        - Involves authentication with MTurk API and MTurk payment token 
+
diff --git a/RA.md → ResearchAssistants/RA.md b/RA.md → ResearchAssistants/RA.md
diff --git a/RA_Expectations.md → ResearchAssistants/RA_Expectations.md b/RA_Expectations.md → ResearchAssistants/RA_Expectations.md
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,6 @@

		## Lab guidelines for using Lab AWS Utility Package

		https://github.com/Watts-Lab/AWSUtil