From 47f1b27a6fbe62045fe901cbee436be57f772aa3 Mon Sep 17 00:00:00 2001 From: chang48 Date: Mon, 1 Jul 2024 06:37:40 +0000 Subject: [PATCH] Website rebuild chang48/hugo@46cc51bcd49d0e3f68e07b0c0375fb7f96e67b51 --- about/index.html | 2 +- categories/forecast/index.html | 2 +- categories/study-notes/index.html | 2 +- index.html | 2 +- index.json | 2 +- posts/2021/11-05-orion/index.html | 2 +- posts/2021/gaussial-integral/index.html | 2 +- posts/2022/02-28-horsehead/index.html | 2 +- posts/2022/03-04-monkey/index.html | 2 +- posts/2022/05-30-elephans-trunk/index.html | 2 +- posts/2022/06-05-m13/index.html | 2 +- posts/2022/07-04-squid/index.html | 2 +- posts/2022/11-04-veil/index.html | 2 +- posts/2022/11-24-soul-nebula/index.html | 2 +- .../index.html | 2 +- .../mlops-week12-data-repositories/index.html | 2 +- .../mlops-week13-data-ingestion/index.html | 2 +- .../index.html | 2 +- .../mlops-week31-model-selection/index.html | 2 +- .../2022/mlops-week32-forecasting/index.html | 2 +- .../mlops-week33-computer-vision/index.html | 2 +- posts/2022/mlops-week4/index.html | 2 +- posts/2023/genki-lesson-13/index.html | 78 ++++++++++++++++++- posts/2024/06-28-forecasting-1/index.html | 2 +- posts/2024/06-30-forecasting-2/index.html | 20 ++++- posts/index.html | 2 +- tags/forecasting/index.html | 2 +- 27 files changed, 118 insertions(+), 30 deletions(-) diff --git a/about/index.html b/about/index.html index 71b8e9a..89169d6 100644 --- a/about/index.html +++ b/about/index.html @@ -4,7 +4,7 @@ -

About

·330 words·2 mins
Author
fermion

Welcome
#

Welcome. I use this blog to collect notes and some random stuffs I do. The nickname fermion +

About

·330 words·2 mins
Author
fermion
Table of Contents

Welcome
#

Welcome. I use this blog to collect notes and some random stuffs I do. The nickname fermion comes from a concept in quantum manybody physics. It is a class of microscopic particles that is unique in the sense that no two fermions can have the same quantum numbers. To some level, we as humen are also like fermions: each of us (or our state, to be more precise) is unique. diff --git a/categories/forecast/index.html b/categories/forecast/index.html index 04ab844..0c6fd7c 100644 --- a/categories/forecast/index.html +++ b/categories/forecast/index.html @@ -4,7 +4,7 @@ -

Forecast

2024

Forecasting: Chap. 2
·355 words·2 mins
Forecasting: Chap. 1
·534 words·3 mins

© +

Forecast

2024

Forecasting: Chap. 2
·562 words·3 mins
Forecasting: Chap. 1
·534 words·3 mins

© 2024 fermion

Powered by Hugo & Blowfish

\ No newline at end of file diff --git a/categories/study-notes/index.html b/categories/study-notes/index.html index 5d03107..8499999 100644 --- a/categories/study-notes/index.html +++ b/categories/study-notes/index.html @@ -4,7 +4,7 @@ -

Study Notes

2024

Forecasting: Chap. 2
·355 words·2 mins
Forecasting: Chap. 1
·534 words·3 mins

2022

Study notes: MLops Week 4
·518 words·3 mins
Study notes: MLops Week 3-3 Computer Vision
·292 words·2 mins
Study notes: MLops Week 3-2 Forecasting
·325 words·2 mins
Study notes: MLops Week 3-1 Machine Learning pipeline
·845 words·4 mins
Study notes: MLops Week 2 AWS ML Data Preparation
·414 words·2 mins
Study notes: MLops Week 1-3 Data Ingestion and Transformation
·329 words·2 mins
Study notes: MLops Week 1-2 Data Repositories
·475 words·3 mins
Study notes: MLops Week 1-1 AWS Machine Learning Technologies
·173 words·1 min

© +

Study Notes

2024

Forecasting: Chap. 2
·562 words·3 mins
Forecasting: Chap. 1
·534 words·3 mins

2022

Study notes: MLops Week 4
·518 words·3 mins
Study notes: MLops Week 3-3 Computer Vision
·292 words·2 mins
Study notes: MLops Week 3-2 Forecasting
·325 words·2 mins
Study notes: MLops Week 3-1 Machine Learning pipeline
·845 words·4 mins
Study notes: MLops Week 2 AWS ML Data Preparation
·414 words·2 mins
Study notes: MLops Week 1-3 Data Ingestion and Transformation
·329 words·2 mins
Study notes: MLops Week 1-2 Data Repositories
·475 words·3 mins
Study notes: MLops Week 1-1 AWS Machine Learning Technologies
·173 words·1 min

© 2024 fermion

Powered by Hugo & Blowfish

\ No newline at end of file diff --git a/index.html b/index.html index 81365bc..f7f45a2 100644 --- a/index.html +++ b/index.html @@ -4,7 +4,7 @@ -

Walking in the woods...

Recent

Forecasting: Chap. 2
·355 words·2 mins
Forecasting: Chap. 1
·534 words·3 mins
Soul Nebula
·95 words·1 min
Veil Nebula
·95 words·1 min
Squid Nebula
·325 words·2 mins
M13 Cluster
·88 words·1 min
IC1396 Elephan's Trunk Nebula
·143 words·1 min

© +

Walking in the woods...

Recent

Forecasting: Chap. 2
·562 words·3 mins
Forecasting: Chap. 1
·534 words·3 mins
Soul Nebula
·95 words·1 min
Veil Nebula
·95 words·1 min
Squid Nebula
·325 words·2 mins
M13 Cluster
·88 words·1 min
IC1396 Elephan's Trunk Nebula
·143 words·1 min

© 2024 fermion

Powered by Hugo & Blowfish

\ No newline at end of file diff --git a/index.json b/index.json index d787130..36966f7 100644 --- a/index.json +++ b/index.json @@ -1 +1 @@ -[{"content":" Welcome # Welcome. I use this blog to collect notes and some random stuffs I do. The nickname fermion comes from a concept in quantum manybody physics. It is a class of microscopic particles that is unique in the sense that no two fermions can have the same quantum numbers. To some level, we as humen are also like fermions: each of us (or our state, to be more precise) is unique. Hence the nickname.\nWhat I do for living # I\u0026rsquo;m currently a Research Data Specialist. Before that, I was a Data Scientist working on retail demand forecast and price optimization. Before that, I was a theoretical condensed matter physicist using quantum Monte Carlo methods to study properties of strongly correlated electron systems.\nThings I like to do when I\u0026rsquo;m not working # Jogging and badminton. Because my job requires a lot of screen time, I try to stay away from my computer when I am not at work. Jogging keeps me in shape, and badminton is a racket sport I really like since high school.\nFlute playing. I\u0026rsquo;m a flute enthusiast. I have been taking lessons since 2014. My favorite flutists include Peter-Lukas Graf, Emmanuel Pahud, Patrick Gallois, and Denis Bouriakov. In terms of flute music, I like those written by composers in the Baroque and Classical eras. J. S. Bach\u0026rsquo;s flute sonata and Mozart\u0026rsquo;s flute quartets are my all time favorite. Modern French composers like Debussy, Faure, and Gaubert also have some very sweet flute music.\nAstrophotography. I started doing astrophotography recently. Back then when I was still a student, I was very interested in astrophotography but could not afford to do it because equipments such as the mount and telescope were expensive (they still are today). It is until a few years ago I was able to get into this hobby. The seeing in my neighborhood is pretty decent, so I am able to snap pretty nice pictures in my backyard. Here is my Astrobin page.\n","date":"20 August 2017","externalUrl":null,"permalink":"/about/","section":"Walking in the woods...","summary":"Welcome # Welcome.","title":"About","type":"page"},{"content":"","date":"30 June 2024","externalUrl":null,"permalink":"/categories/","section":"Categories","summary":"","title":"Categories","type":"categories"},{"content":"","date":"30 June 2024","externalUrl":null,"permalink":"/categories/forecast/","section":"Categories","summary":"","title":"Forecast","type":"categories"},{"content":"","date":"30 June 2024","externalUrl":null,"permalink":"/tags/forecasting/","section":"Tags","summary":"","title":"Forecasting","type":"tags"},{"content":" Notes on Chapter 2 of Forecasting: Principles \u0026amp; Practice\nTime series patterns # Trend: Trend exists when there is a long-term increase or decrease in the data. The meaning of long or short is relative and depends on the data\u0026rsquo;s time scale. The trend does not need to be monotonic. It might go from an increasing trend to a decreasing one and vice versa. So long as the time scale of the tendency is larger than the data\u0026rsquo;s time scale, trend can be identified.\nFor example, Fig. 1 shows the weekly passenger load of Ansett Airlines\u0026rsquo; economy class between 1986 and 1993. The data clearly indicates that there is an increasing trend, though bumpy, in the years 1986, 1990, and 1991 that lasted over several months.\nFig. 1 Weekly economy passenger load on Ansett Airlines. Credit: Forecasting: Principles and Practice Fig. 2 summarizes the monthly sales of antidiabetic drugs in Australia between 1992 and 2008. Clearly, this example demonstrates that the sales is trending upward in this time period.\nFig. 2 Monthly sales of antidiabetic drugs in Australia. Credit: Forecasting: Principles and Practice Seasonal: When a time series is affected by seasonal factors such as the time of the year or the day of the week, we say a seasonal pattern exists. Seasonality is always of a fixed and known period. For example, the drug sales data shown in Fig. 2 shows a strong yearly seasonality, and the pattern persists.\nCyclic: A cycle occurs when the data exhibit rises and falls that are not of a fixed frequency. The emphasis on the non-fixed frequency makes it clear that cyclic pattern and seasonality are different concepts. The latter, as Fig. 2 illustrates, has a clear and well-defined period/frequency. The textbook also mentions that in general the length of a cycle is longer than the length of seasonal patterns. For example, Fig. 3 is the Sunspot\u0026rsquo;s activity between 1920 and 2020. The sunspot number has a roughly 11 year cycle. this period is not tied to any seasonal factor such as daily, weekly, or yearly and is much longer. Fig. 3 International Sunspot number. Credit: Forecasting: SpaceWeatherLive.com ","date":"30 June 2024","externalUrl":null,"permalink":"/posts/2024/06-30-forecasting-2/","section":"Posts","summary":"Notes on Chapter 2 of Forecasting: Principles \u0026amp; Practice","title":"Forecasting: Chap. 2","type":"posts"},{"content":"","date":"30 June 2024","externalUrl":null,"permalink":"/posts/","section":"Posts","summary":"","title":"Posts","type":"posts"},{"content":"","date":"30 June 2024","externalUrl":null,"permalink":"/categories/study-notes/","section":"Categories","summary":"","title":"Study Notes","type":"categories"},{"content":"","date":"30 June 2024","externalUrl":null,"permalink":"/tags/","section":"Tags","summary":"","title":"Tags","type":"tags"},{"content":"","date":"30 June 2024","externalUrl":null,"permalink":"/","section":"Walking in the woods...","summary":"","title":"Walking in the woods...","type":"page"},{"content":" Notes on Chapter 1 of Forecasting: Principles \u0026amp; Practice\nWhat can be forecast # Factors that affect the predictability of an event or a quantity\nHow well we understand the factors that contribute to it; How much data is available; How similar the future is to the past; Whether the forecasts can affect the thing we are trying to forecast. Example 1: Short-term forecasts of residential electricity demand\nTemperature is a primary driving force of the demand, especially in summer. Historical electricity usage data is available. It\u0026rsquo;s safe to assume that the demand behavior will be similar to that in the past. I.e., there is some degree of seasonality. Price of the electricity is not strongly dependent on the demand. So the forecast will not have too much effect on consumer behavior. Example 2: Currency exchange rate\nWe have very limited knowledge about what really contributes to exchange rates. There is indeed a lot of historical exchange rate data available. Very difficult to say that the future will be similar to the past. The market is very sensitive to a number of unpredictable factors such as political situation, a country\u0026rsquo;s financial stability and economy policies, \u0026hellip; etc. The exchange rate is bound to have strong correlation to the forecast outcome, as the market will response to any forecast results. This is called the efficient market hypothesis. Based on the predictability criterion, the exchange rate is likely not predictable. In fact, things like stock price and lottery number fall in this category.\nDetermine what to forecast # It is necessary to consider the forecasting horizon. Will it be one month in advance, 6 months, or for multiple years. Depending on the forecast horizon, different types of models will be necessary.\nForecasting data and methods # Qualitative forecasting: If there are no data available, or the data are not relevant to the forecast. Quantitative forecasting can be applied when: Numerical information about the past is available. It is reasonable to assume that some aspects of the past patterns will continue into the future. Forecasting models # Explanatory model: In this scenario, the historical behavior of a time series is assumed to be captured by the so-called predictor variables. For example, the hourly electricity demand \\(d\\) of a hot region during summer can be modeled by $$ d = F(\\text{temperature}, \\text{population}, \\text{time of day}, \\text{day of week}, \\text{error} ). $$\nThe relationship is not exact, but these variables are primary factors that are likely to impact the electricity demand. This type of model explains what causes the variation in electricity demand.\nTime series model: Electricity demand data form a time series. Hence a time series model can be used for forecasting. In this case, the demand \\(d_{t+1}\\) at time \\(t+1\\) is expressed as follows $$ d_{t+1} = F(d_t, d_{t-1}, d_{t-2}, \\ldots, \\text{error}), $$ where \\(t\\) represents the current hour, \\(t+1\\) is the next hour, \\(t-1\\) is the previous hour, and so on. The prediction of the future is based on past values of a variable but not on external variables that may affect the system.\nMixed models: The combination of the above two models $$ d_{t+1} = F( d_t, \\text{temperature}, \\text{population}, \\text{time of day}, \\ldots, \\text{error}), $$\n","date":"26 June 2024","externalUrl":null,"permalink":"/posts/2024/06-28-forecasting-1/","section":"Posts","summary":"Notes on Chapter 1 of Forecasting: Principles \u0026amp; Practice","title":"Forecasting: Chap. 1","type":"posts"},{"content":"","date":"24 November 2022","externalUrl":null,"permalink":"/categories/astrophotography/","section":"Categories","summary":"","title":"Astrophotography","type":"categories"},{"content":"","date":"24 November 2022","externalUrl":null,"permalink":"/tags/nebula/","section":"Tags","summary":"","title":"Nebula","type":"tags"},{"content":"Westerhout 5, or known by most as Soul Nebula, is an emission nebula located in Cassiopeia. This is a large star forming region like the Orion Nebula. The dark hollow areas embedded in the blue regions are cavities that were carved out by radiation and winds from the region\u0026rsquo;s massive stars.\nIntegrated image # Telescope: Takahashi FSQ-85EDX Camera: ZWO ASI294MM Pro Mount: iOptron CEM40 Filters: Astronomik H-alpha 6nm 1.25\u0026quot;, SII 6nm 1.25\u0026quot;, OIII filters 6nm 1.25\u0026quot; Post Processing: Astro Pixel Processor + PixInsight Total Integration time: 28h 25' Check out the photo on my Astrobin\n","date":"24 November 2022","externalUrl":null,"permalink":"/posts/2022/11-24-soul-nebula/","section":"Posts","summary":"\u003cp\u003eWesterhout 5, or known by most as Soul Nebula, is an emission nebula located in Cassiopeia.\nThis is a large star forming region like the Orion Nebula. The dark hollow areas embedded\nin the blue regions are cavities that were carved out by radiation and winds from the region\u0026rsquo;s\nmassive stars.\u003c/p\u003e","title":"Soul Nebula","type":"posts"},{"content":"The Veil Nebula. This is a 2x3 Mosaic narrow-band imaging project that took me almost two months (9/4/2022 ~ 11/1/2022) to complete. A total of 2155 subs with 179 hours of accumulated exposure time. This nebula is a magnificent supernova remnant in the area of Cygnus, one of my favorite constellations.\nIntegrated image # Telescope: Takahashi FSQ-85EDX Camera: ZWO ASI294MM Pro Mount: iOptron CEM40 Filters: Astronomik H-alpha 6nm 1.25\u0026quot;, SII 6nm 1.25\u0026quot;, OIII filters 6nm 1.25\u0026quot; Post Processing: Astro Pixel Processor + PixInsight Total Integration time: 179h 35' Check out the photo on my Astrobin\n","date":"4 November 2022","externalUrl":null,"permalink":"/posts/2022/11-04-veil/","section":"Posts","summary":"\u003cp\u003eThe Veil Nebula. This is a 2x3 Mosaic narrow-band imaging project that took me almost two months (9/4/2022 ~ 11/1/2022)\nto complete. A total of 2155 subs with 179 hours of accumulated exposure time. This nebula is a magnificent supernova\nremnant in the area of Cygnus, one of my favorite constellations.\u003c/p\u003e","title":"Veil Nebula","type":"posts"},{"content":"SH2-129 Flying Bat (red) and OU-4 Squid Nebulae. The combo can be found in the constellation Cepheus. Both are in the family of emission nebula. The Flying Bat consists of mostly ionized hydrogen gas while the Squid is a region full of ionized oxygen clouds. Due to its extremely dimness, the Squid was discovered only recently in 2011 by French astrophotographer Nicolas Outters. The formation of the Squid is still under debate. But many astronomers believe the bright star HD202214 (near the center of the photo) plays a major role.\nTwo years into the hobby, I decided to take on the challenge. And, not surprisingly, this is the most demanding project by far. The total integration time is 72 hours. I knew the squid is an extremely faint emission nebula before started. However, it\u0026rsquo;s more difficult than I expected to get enough photons even in a single 10 min sub for a f/5.3 scope and under Bottle 5/6 skies. Luckily, the weather in Northern California was almost perfect in the past 20 days or so during the project. Cloud coverage was constantly around or below 6-8%, the worst being 12-15%.\nApart from a guiding issue I still need to sort out, imaging the objects was pretty smooth. Processing the data, however, is another story. For my setup, the signal for the squid is not much above the background noise level. Stretching the squid without blowing up the noise was really pushing my processing skills to the limit. I ended up using a mask in order to suppress the noise in the nonlinear stage. All in all, I\u0026rsquo;ve learned a lot from the project. Hope you will like the results.\nIntegrated image # Telescope: Takahashi FSQ-85EDX Camera: ZWO ASI294MM Pro Mount: iOptron CEM40 Filters: Astronomik Deep-Sky RGB 31mm + Astronomik H-alpha 6nm 1.25\u0026quot; + OIII 6nm 1.25\u0026quot; Post Processing: Astro Pixel Processor + PixInsight Total Integration time: 71h 50' Check out the photo on my Astrobin\n","date":"4 July 2022","externalUrl":null,"permalink":"/posts/2022/07-04-squid/","section":"Posts","summary":"\u003cp\u003eSH2-129 Flying Bat (red) and OU-4 Squid Nebulae. The combo can be found in the constellation Cepheus.\nBoth are in the family of emission nebula. The Flying Bat consists of mostly ionized hydrogen gas while\nthe Squid is a region full of ionized oxygen clouds. Due to its extremely dimness, the Squid was\ndiscovered only recently in 2011 by French astrophotographer Nicolas Outters. The formation of\nthe Squid is still under debate. But many astronomers believe the bright star HD202214 (near the center\nof the photo) plays a major role.\u003c/p\u003e","title":"Squid Nebula","type":"posts"},{"content":"M13, the Great Globular Cluster. Located in the constellation of Hercules., it is roughly 27,100 light years away from Earth. The cluster is the brightest globular cluster in the northern hemisphere, visible to the naked eye under dark skies. My first astrophotography image taken with the Takahashi FSQ-85EDX, also called the Baby-Q.\nIntegrated image # Telescope: Takahashi FSQ-85EDX Camera: ZWO ASI294MM Pro Mount: iOptron CEM40 Filters: ZWO LRGB 31mm Post Processing: Astro Pixel Processor + PixInsight Total Integration time: 3h 40' Check out the photo on my Astrobin\n","date":"5 June 2022","externalUrl":null,"permalink":"/posts/2022/06-05-m13/","section":"Posts","summary":"\u003cp\u003eM13, the Great Globular Cluster. Located in the constellation of Hercules., it is roughly 27,100 light\nyears away from Earth. The cluster is the brightest globular cluster in the northern hemisphere, visible\nto the naked eye under dark skies. My first astrophotography image taken with the Takahashi FSQ-85EDX,\nalso called the \u003cstrong\u003eBaby-Q\u003c/strong\u003e.\u003c/p\u003e","title":"M13 Cluster","type":"posts"},{"content":"","date":"5 June 2022","externalUrl":null,"permalink":"/tags/star-cluster/","section":"Tags","summary":"","title":"Star Cluster","type":"tags"},{"content":"Elephant\u0026rsquo;s Trunk and the IC1396. Standard SHO narrow band image. Elephant\u0026rsquo;s trunk (the dark cloud in the left) is an area of interstellar gas and dust in the large ionized gas region IC1396. The area is located in the constellation Cepheus, about 2400 light years away from Earth.\nThe entire area is very dynamic. The IC1396 is being illuminated and ionized by a bright and massive star HD 206267 seen near the center of the frame. The Elephant\u0026rsquo;s Trunk region is currently thought to be a star formation site containing many young stars (less than 100,000 years old).\nIntegrated image # Telescope: Takahashi FSQ-85EDX Camera: ZWO ASI294MM Pro Mount: iOptron CEM40 Filters: Astronomik H-alpha 6nm 1.25\u0026quot; + SII 6nm 1.25\u0026quot; + OIII 6nm 1.25\u0026quot; Post Processing: Astro Pixel Processor + PixInsight Total Integration time: 15h 35' Check out the photo on my Astrobin\n","date":"30 May 2022","externalUrl":null,"permalink":"/posts/2022/05-30-elephans-trunk/","section":"Posts","summary":"\u003cp\u003eElephant\u0026rsquo;s Trunk and the IC1396. Standard SHO narrow band image. Elephant\u0026rsquo;s trunk (the dark cloud in the left) is an area of interstellar gas and dust in the large ionized gas region IC1396. The area is located in the constellation Cepheus, about 2400 light years away from Earth.\u003c/p\u003e\n\u003cp\u003eThe entire area is very dynamic. The IC1396 is being illuminated and ionized by a bright and massive star HD 206267 seen near the center of the frame. The Elephant\u0026rsquo;s Trunk region is currently thought to be a star formation site containing many young stars (less than 100,000 years old).\u003c/p\u003e","title":"IC1396 Elephan's Trunk Nebula","type":"posts"},{"content":"","date":"9 April 2022","externalUrl":null,"permalink":"/tags/data-science/","section":"Tags","summary":"","title":"Data Science","type":"tags"},{"content":"","date":"9 April 2022","externalUrl":null,"permalink":"/tags/mlops/","section":"Tags","summary":"","title":"MLops","type":"tags"},{"content":"Quick notes of AWS MLops, Week 4\nTalk about availability, scalability, and resilience # Monitoring and Logging\nView point: Data science for Software system AWS CloudWatch Dashboards Search Alerts Automated insights Use CloudWatch to pull in info from servers hosting source codes and monitoring agents (e.g. CPU metrics, Memory metrics, and DISK I/O metrics) Multiple Regions\nResources are distributed across isolated geographic regions which have multiple availability zones Create as many as redundant infrastructures Increase resilience Reproducible Workflows\nInfrastructure as code (IAC) workflow: the idea behind infrastructure as code is that there isn\u0026rsquo;t a human pushing buttons that make something happen. The code is triggered to built by events. \u0026lt;/figure\u0026gt; Implement Appropriate ML Services # Comparisons between higher infrastructure control and faster application development and deployment Provisioning EC2\nCan launch EC2 from console, SDK, or CLI. Sub-components: User data: could put special instructions here Storage: EBS versus Instance Security Group: firewall rules for the EC2 launch SSH Key pair Have Amazon Machine Image? Instances type: CPU vs. GPU Cost: On demand vs. Spot Virtual Private Cloud (VPC) IAM Role Provisioning EBS, Elastic Beanstalk, various possibilities of building on top of AWS platform\nKey idea: Elastic Beanstalk can scale up and down resources automatically according to the health metrics from the load balancer. The provisioning model is elastic. Block storage can be provisioned to have high bandwidth The user decides which parts should be pre-provisioned and which parts should be elastic. Example: You need to have extremely high band width storage for doing machine learning training where you had a cluster of machines all talking to the same amount point. AWS ML Services\nMany high level ML services provided Provides both GUI, Console access through boto3 Examples and tutorials boto3 documentations have all API call details Deploying and Secure ML Solutions # Principle of Least Privilege AWS Lambda\nConfigure the Lambda micro service to have the minimal privilege necessary for accessing upstream (e.g. AWS S3) and downstream (e.g. DynamoDB) access. Can be achieved through IAM role based policies Integrated Security\nAWS security firewall, blocking incoming ports via role-based privileges Within the firewall, data transfer is encrypted between source and AWS S3 object storage Everything is inside a virtual private cloud (AWS VPC) Use automated deployment or infrastructure as code, no need to worry about making manual mistakes Audit: AWS CloudTrail monitors API calls and all actions that is occurring in the network Use AWS SageMaker Studio to prepare data, build model, train \u0026amp; tune model, and deploy. This platform provides launchers that have many models and templates for jump-starting any projects\nAWS SageMaker Canvas: Using Canvas is a great way to understand at a high level around machine learning problems solved, and you can also do this by building your own machine learning system with SageMaker or building your own system outside using AWS Cloud9.\nData Drift and Model Monitoring:\nUse data to train the first model New data comes in, triggers a data drift alert saying a new model is needed New data is combined with the old one and a new model is trained, registered, and deployed ","date":"9 April 2022","externalUrl":null,"permalink":"/posts/2022/mlops-week4/","section":"Posts","summary":"\u003cp\u003eQuick notes of AWS MLops, Week 4\u003c/p\u003e","title":"Study notes: MLops Week 4","type":"posts"},{"content":"Week 3-3 of the AWS MLops: Computer vision and AWS Rekognition\nComputer Vision and Amazon Rekognition # Computer vision # Automated extraction of information from digital images Applications Public safety and home security Authentication and enhanced computer-human interaction Content management and analysis Autonomous driving Medial imaging Manufacturing process control Computer vision problems: Image analysis Object classification Object detection Object segmentation Video analysis Instance tracking, pathing Action recognition Motion estimation Amazon Rekognition # Managed service for image and video analysis\nTypes of analysis:\nSearchable image and video libraries Face-based user verification Sentiment and demographic analysis Unsafe content detection Can add powerful visual analysis to application\nHighly scalable and continuously learns\nIntegrates with other AWS services\nExamples:\nSearchable image library \u0026lt;/figure\u0026gt; Image moderation \u0026lt;/figure\u0026gt; Sentiment analysis \u0026lt;/figure\u0026gt; AWS services used in these examples: S3 Lambda Rekognition Elasticsearch Service Kinesis Video Streams Kinesis Data Streams Redshift QuickSight Custom Labels # Example use cases Search logos Identify products Identify machine parts Distinguish between healthy and infected plants Almost all vision solutions start with an existing model Custom labeling process Collect images Collect few hundred images Build domain-specific models 10 PNG or JPEG images per label Use images similar to the images that you want to detect Create dataset Images, labels, and bounding boxe Need at least two labels Label images by using console or Amazon Sagemaker Ground Truth Model evaluation Precision, recall Overall model performance Improve the model Better and more data Reduce false positives (better precision): could add more classes as labels for training Reduce false negatives (better recall): use better data or more precise classes (labels) for training Adjust the confidence threshold to tune precision/recall Use model Apply the model on new images and collect custom labels: label, object bounding box, and confidence level ","date":"3 April 2022","externalUrl":null,"permalink":"/posts/2022/mlops-week33-computer-vision/","section":"Posts","summary":"\u003cp\u003eWeek 3-3 of the AWS MLops: Computer vision and AWS Rekognition\u003c/p\u003e","title":"Study notes: MLops Week 3-3 Computer Vision","type":"posts"},{"content":"Week 3-2 of the AWS MLops: AWS Forecast and time-series data\nForecasting and AWS Forecast # Overview # Predicting future values that are based on historical data Patterns include Trends Seasonal, pattern that is based on seasons Cyclical, other repeating patterns Irregular, patterns that might appear to be random Examples Sales and demand forecast Energy consumption Inventory projections Weather forecast Processing time series data # Time series data is captured in sequence over time\nHandle missing data\nForward fill Backward fill Moving average Interpolation: linear, spline, or polynomial Sometimes zero is a good fill value Reasmpling: Resampling time series data allows the flexibility of defining resolution of the data\nUpsampling: increase the sample frequency, e.g. from minutes to seconds. Care must be taken in deciding how the fine-grained samples are computed. Downsampling: decrease the sample frequency, e.g. from days to months. Need to pay attention to how the aggregation is carried out. Reasons for resampling: Inspect the behavior of data under different resolutions Join tables with different resolutions Sampling smoothing, including outlier removal\nWhy Part of the data preparation process For visualization How does smoothing affect the outcome Cleaner data to model Model compatibility Production improvement? Seasonality\nHourly, daily, quarterly, yearly Spring, summer, fall, winter Holidays Time series sample correlations\nStationary How stable is the system Does the past inform the future Trends Correlation issues Autocorrelation How points in time series sample are linearly related pandas offer many methods for handling time series data\nTime-aware index groupby and resample() autocorr() method Times series algorithms offered by Amazon Forecast\nARIMA, autoregressive integrated moving average DeepAR+ Exponential Smoothing (ETS) Non-Parametric Time Series (NPTS) Prophet Model evaluation\nTime series data model training cannot use $k$-fold cross validation because the data is ordered and correlated. Standard approach: back testing \u0026lt;/figure\u0026gt; Two metrics can be used to access the backtesting (hindcasting instead of forecasting) accuracy wQuantileLoss: the average error for each quantile in a set RMSE, root mean square error ","date":"28 March 2022","externalUrl":null,"permalink":"/posts/2022/mlops-week32-forecasting/","section":"Posts","summary":"\u003cp\u003eWeek 3-2 of the AWS MLops: AWS Forecast and time-series data\u003c/p\u003e","title":"Study notes: MLops Week 3-2 Forecasting","type":"posts"},{"content":"Week 3-1 of the AWS MLops: ML steps, pipeline, and AWS SageMaker\nMachine Learning Pipeline and AWS SageMaker # Forming business problem # Define business objective, questions to ask:\nHow is this task done today? How will the business measure success? How will the solution be used? Do similar solutions exist? What are the assumptions? Who are the domain experts? Frame the business problem\nIs it a machine learning problem? What kind of ML problem is it? Classification or regression? Is the problem supervised or unsupervised? What is the target to predict? Have access to the relevant data? What is the minimum or baseline performance? Would you solve the problem manually? What is the simplest solution? Can the success be measured? Collect and secure data, ETL # Data sources\nPrivate data Commercial data: AWS Data Exchange, AWS Marketplace, $\\ldots$ Open-source data Kaggle World Health Organization US Census Bureau National Oceanic and Atmospheric Administration UC Irvine Machine Learning repository AWS Critical question: Is you data representative? ETL with AWS Glue\nRuns the ETL process Crawls data sources to create catalogs that can be queried ML functionality AWS Glue can glue together different datasets and emit a single endpoint that can be queried Data security: Access control and Data encryption\nControl access using AWS Identity and Access Management (IAM) policy AWS S3 default encryption AWS RDS encryption AWS CloudTrail: tracks user activity and application programing interface (API) usage Data evaluation # Make sure the data is in the correct format Use descriptive statistics to gain insights into the dataset before cleaning the data Overall statistics Categorical statistics can identify frequency of values and class imbalances Multivariate statistics Scatter plot to inspect the correlation between two variables pandas provides scatter_matrix method to examine multivariate correlations Correlation matrix and heat map Attribute statistics Feature engineering # Feature extraction\nData encoding\nCategorical data must be converted to a numeric scale If data is non-ordinal, the encoded value must be non-ordinal which might need to be broken into multiple categories Data cleaning\nVariations in strings: text standardization Variations in scale: scale normalization Columns with multiple data items: parse into multiple columns Missing data: Cause of missing data: undefined values data collection errors data cleaning errors Plan for missing data: ask the following questions first What were the mechanisms causing the missing data? Are values missing at random? Are rows or columns missing that you are not aware of? Standard approaches Dropping missing data Imputing missing data Outliers Finding the outliers: box plots or scatter plots for visualization Dealing with outliers: Delete - e.g. outliers were created by artificial errors Transform - reduce the variation Impute - e.g. use mean for the outliers Feature selection\nFilter method\nPearson\u0026rsquo;s correlation Linear discriminant analysis (LDA) Analysis of variance (ANOVA) Chi-square $\\chi^2$ analysis Wrapper method\nForward selection Backward selection \u0026lt;/figure\u0026gt; Embedded method\nDecision trees LASSO and RIDGE \u0026lt;/figure\u0026gt; Model training and evaluation # Holdout method\nSplit the data into training and test sets The model is trained on the training set. Afterwards, its performance is evaluated by testing the model on the test set data which the model has never touched. Advantage: straightforward to implement and computationally cheap because training and testing are carried out once each. Disadvantage: It could happen that the test set and the training set have different statistical distributions, i.e. the test set data cannot faithfully represent the training set distribution. In this case, the validation result is likely not accurate. If we tune the model based on a single test set, we may end up overfitting the test data set. While this approach can be improved by using training, validation, and test set, the result might still depend on the way data sets are prepared, leading to some degrees of bias. $k$-fold cross-validation method, an evaluation method that minimizes the disadvantages of the holdout method.\nDivide the whole data set into training and test set. Shuffle the training set randomly, if possible.1 Split the training set into $k$ non-overlapping subsets (folds) that are equally partitioned, if possible. For each of the $k$ folds: Train a new model on the $k-1$ folds and validate using the remaining fold. Retain the evaluation score and discard the model. The performance metric is obtained by averaging the $k$ evaluation scores. The test set is used for final evaluation. \u0026lt;/figure\u0026gt; To avoid data leakage, any feature engineering should be carried out separately for training and validation inside the CV loop. Reference for practical advice on cross-validation, including imbalanced data set Evaluation\nClassification problems Confusion matrix F1 score, the harmonic mean of precision and sensitivity AUC-ROC Regression Mean squared error Model tuning # Amazon Sagemaker offers automated hyperparamter tuning Best practices Don\u0026rsquo;t adjust every hyperparameter Limit the range of values to what\u0026rsquo;s most effective Run one training job at a time instead of multiple jobs in parallel In distributed training jobs, make sure that the objective metric that you want is the one that is reported back With Amazon SageMaker, convert log-scaled hyperparameters to linear-scaled when possible ]: Time-series data is ordered and can\u0026rsquo;t be shuffled.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","date":"20 March 2022","externalUrl":null,"permalink":"/posts/2022/mlops-week31-model-selection/","section":"Posts","summary":"\u003cp\u003eWeek 3-1 of the AWS MLops: ML steps, pipeline, and AWS SageMaker\u003c/p\u003e","title":"Study notes: MLops Week 3-1 Machine Learning pipeline","type":"posts"},{"content":"Week 2 of the AWS MLops: Data preparation using AWS services\nMachine Learning (ML) and AWS ML Services # Deep learning $\\subset$ Machine learning $\\subset$ Artificial intelligence\nML is the scientific study of algorithms and statistical models to perform a task using inference instead of instructions\nTypical workflow: Data → Model → Prediction\nTypes of ML algorithms and common business use cases\nSupervised learning: Fraud detection Image recognition - computer vision Customer retention Medical diagnostics - computer vision Personalized advertising Sales prediction Weather forecasting Market projection Population growth prediction $\\ldots$ Unsupervised learning: Product recommendations Customer segmentation Targeted marketing Medical diagnostics Natural language processing - chatbot, translation, sentiment analysis Data structure discovery Gene sequencing $\\ldots$ Reinforcement learning: best when the desired outcome is known but the exact path to achieving it is not known Game AI Self-driving cats Robotics Customer service routing $\\ldots$ Use ML when you have:\nLarge datasets, large number of variables Lack of clear procedures to obtain solutions Existing ML expertise Infrastructure already in place to support ML Management support for ML Typical ML workflow Iterative process Data processing Training Evaluation ML frameworks and infrastructure\nFrameworks provide tools and code libraries Customized scripting Integration with AWS services Community of developers Example: PyTorch, TensorFlow, scikit-learn, $\\ldots$ Infrastructure Designed for ML applications AWS IoT Greengrass provides an infrastructure for building ML for IoT devices AWS Elastic Inference reduces costs for running ML apps AWS ML managed services, no ML experience required\nComputer vision: Amazon Rekognition, Amazon Textract Speech: Amazon Polly, Amazon Transcribe Language: Amazon Comprehend, Amazon Translate Chabots: Amazon Lex Forecasting: Amazon Forecast Recommendations: Amazon Personalize Three layers of the Amazon Machine Learning stack:\nManaged Services Machine Learning Services MAchine Learning Frameworks ML challenges\nData Poor quality Non-representative Insufficient Overfitting and underfitting Business Complexity in formulating questions Explaining models to business stakeholders Cost of building systems Users Lack of data science expertise Cost of staffing with data scientists Lack of management support Technology Data privacy issue Tool selection can be complicated Integration with other systems Feature Engineering # Public dataset for feature engineering and model tuning\nHugging Face public datasets Kaggle public datasets Amazon S3 buckets An useful concept: combine old features and produce new features for training/validation\nHelpful to create a ML project structure so that the project can be managed and tracked phase by phase\nData ingest Exploratory data analysis (EDA) Modeling Conclusion At the EDA phase, typical approaches\nLook at descriptive statistics Graphing data, examine trends: linear, logarithmic, $\\ldots$ Clustering data ","date":"14 March 2022","externalUrl":null,"permalink":"/posts/2022/mlops-week2-aws-ml-data-preparation/","section":"Posts","summary":"\u003cp\u003eWeek 2 of the AWS MLops: Data preparation using AWS services\u003c/p\u003e","title":"Study notes: MLops Week 2 AWS ML Data Preparation","type":"posts"},{"content":"Week 1-3 of the AWS MLops: Data ingestion and AWS jobs\nAWS job styles: # Batch Glue: creates metadata that allows to perform operations on e.g. S3 or a database. This is a serverless ETL system. Batch: general purpose batch, can process anything at scale in containers and training models with GPUs. Step functions: parameterize different steps, orchestrate Lambda functions with inputs. Streaming Kinesis: send in small payloads and process them as it receives the payloads. Kafka via Amazon MSK In terms of operation complexity and data size, here is a high level comparison\nBatch Streaming complexity simple complex data size large small Complexity: Batch jobs are simpler, they receive data, execute operations across, then give back results. Streaming jobs on the other hand need to take in data as they come in and a bit more prone to error and mistake. Data size: Batch jobs are good at handling large data payloads since they are designed to process in batch. While streaming jobs process things as they come in. Batch: data ingestion and processing pipelines # Examples # Example 1 - AWS Batch: Event trigger creates new jobs New jobs are stored in queue. Can have thousands of jobs. Each job launches its own container and performs things like fine tuning Hugging Face models using GPUs. Example 2 - AWS Step Function: Event trigger First step, a Lambda function, gets JSON payload and exports results. Second step, also a Lambda function, takes outputs from the previous step as inputs. Exports results as JSON. Example 3 - AWS Glue, an ETL pipeline: Event trigger AWS Glue points to multiple data sources: CSV files in S3 or external PostgreSQL database. Glue ties multiple data sources together and creates an ETL then transform the data and put it into a S3 bucket. Glue creates a data catalog that can be queried via AWS Athena without having to actually pull all the data out of S3 for data visualization and maybe manipulation. ","date":"9 March 2022","externalUrl":null,"permalink":"/posts/2022/mlops-week13-data-ingestion/","section":"Posts","summary":"\u003cp\u003eWeek 1-3 of the AWS MLops: Data ingestion and AWS jobs\u003c/p\u003e","title":"Study notes: MLops Week 1-3 Data Ingestion and Transformation","type":"posts"},{"content":"SH2-252 Monkey Head Nebula. Located in the Orion Constellation. Embedded in the nebula is an open cluster NGC 2175. The cluster is about 6350 light years from Earth. The nebula has active ionized oxygen, sulfur, and hydrogen clouds. A very good narrowband subject. Does the nebula look like a monkey head? I can\u0026rsquo;t really tell\u0026hellip;😂\nIntegrated image # Telescope: Takahashi TSA-120N Camera: ZWO ASI294MM Pro Mount: iOptron CEM40 Filters: Astronomik H-alpha 6nm 1.25\u0026quot; + SII 6nm 1.25\u0026quot; + OIII 6nm 1.25\u0026quot; Post Processing: Astro Pixel Processor + PixInsight Total Integration time: 9h Check out the photo on my Astrobin\n","date":"4 March 2022","externalUrl":null,"permalink":"/posts/2022/03-04-monkey/","section":"Posts","summary":"\u003cp\u003eSH2-252 Monkey Head Nebula. Located in the Orion Constellation. Embedded in the nebula is an open\ncluster NGC 2175. The cluster is about 6350 light years from Earth. The nebula has active ionized oxygen,\nsulfur, and hydrogen clouds. A very good narrowband subject. Does the nebula look like a monkey head?\nI can\u0026rsquo;t really tell\u0026hellip;😂\u003c/p\u003e","title":"NGC2175, Monkey Head Nebula","type":"posts"},{"content":"IC434 Horsehead Nebula. This nebula is probably one of the most imaged subject in astrophotography. IC434 is a dark nebula in the constellation Orion. It is about 1375 light years away from Earth. The nebula got its name because the dark dust cloud looks just like a horse\u0026rsquo;s head. Behind the Horsehead nebula is a huge area of ionized hydrogen gas lit up by the star group Sigma Orions (at the top of the image)\nIntegrated image # Telescope: Takahashi TSA-120N + TOA-35 Reducer Camera: ZWO ASI294MM Pro Mount: iOptron CEM40 Filters: ZWO RGB 31mm + Astronomik H-alpha 6nm 1.25\u0026quot; Post Processing: Astro Pixel Processor + PixInsight Total Integration time: 20h 45' Check out the photo on my Astrobin\n","date":"28 February 2022","externalUrl":null,"permalink":"/posts/2022/02-28-horsehead/","section":"Posts","summary":"\u003cp\u003eIC434 Horsehead Nebula. This nebula is probably one of the most imaged subject in astrophotography.\nIC434 is a dark nebula in the constellation Orion. It is about 1375 light years away from Earth.\nThe nebula got its name because the dark dust cloud looks just like a horse\u0026rsquo;s head. Behind the\nHorsehead nebula is a huge area of ionized hydrogen gas lit up by the star group Sigma Orions\n(at the top of the image)\u003c/p\u003e","title":"Horsehead Nebula","type":"posts"},{"content":"Week 1-2 of the AWS MLops: Data repositories and AWS managed storage choices\nIntroduction to AWS S3 # Object storage service that offers scalability, data availability, security, and performance.\n99.99999999999% of durability Easy to use management features Can respond to event triggers Use cases:\nContent storage/distribution Backup, restore, and archive Data lakes and big data analytics Disaster recovery Static website hosting Component:\nBucket: https://s3-\u0026lt;aws-region\u0026gt;.amaonaws.com/\u0026lt;bucket-name\u0026gt;/ Object: https://s3-\u0026lt;aws-region\u0026gt;.amaonaws.com/\u0026lt;bucket-name\u0026gt;/\u0026lt;object-key\u0026gt; Objects in an S3 bucket can be referred by their URL The key value identifies the object in the bucket Prefixes:\nUse prefixes to imply a folder structure in an S3 bucket Specify prefix: 2021/doc-example-bucket/math Returns the following kets: 2021/doc-example-bucket/math/john.txt 2021/doc-example-bucket/math/maris.txt Object metadata:\nSystem-defined: objection creation data object size object version User-defined: information that you assign to the object x-amz-meta key followed by a custom name. Example: x-amz-meta-alt-name Versioning:\nKeep multiple variants of an object in the same bucket In versioning-enabled S3 buckets, each object has a version ID After versioning is enabled, it can only be suspended. Three operations:\nPUT: Upload entire object to a bucket. Max size: 5 GB Should use multipart upload for objects over 100 MB import boto3 S3API = boto3.client(\u0026#34;s3\u0026#34;, region_name=\u0026#34;us-east-1\u0026#34;) bucket_name = \u0026#34;samplebucket\u0026#34; filename = \u0026#34;/resources/website/core.css\u0026#34; S3API.upload_file( filename, bucket_name, \u0026#34;core.css\u0026#34;, ExtraArgs={\u0026#39;ContentType\u0026#39;: \u0026#34;text/css\u0026#34;, \u0026#34;CacheControl\u0026#34;: \u0026#34;max-age=0\u0026#34;}) GET: Used to retrieve objects from Amazon S3 Can retrieve the complete object at once or a range of bytes DELETE: Versioning disabled - object is permanently deleted from the bucket Versioning enabled - delete with key and version ID S3 SELECT:\nA powerful too to query data in place without the need to fetch the data from buckets. Data encryption: S3 has two types of policies for bucket access:\nACLs: access control lists. Resource-based access policy to manage access at the object level or bucket level. Data storage # As the first step, catalog all of the different data source in the organization into a master list. Once the master list is created, develop a strategy around how to process the data in a data engineering pipeline\nDetermine the correct storage medium\nDatabase Key/value database, e.g. DynamoDB, ideal for user records or game stats Graph database, e.g. Neptune, for relationship building SQL, e.g. Amazon Aurora, RDS, for transaction-based queries Data lake: Built on top of S3 Metadata + Storage + Compute Can index things that are inside S3 EFS Elastic File System Amazon EFS is a cloud-based file storage service for apps and workloads that run in the AWS public cloud Automatically grows and shrinks as you add and remove files The system manages the storage size automatically without any provisioning EBS Stands for Elastic Block Storage. This is a high-performance block-storage service designed for AWS Elastic Compute Cloud (AWS EC2) Offers very fast file system, ideal for machine learning training that requires fast file IO MLOps Template Github # Great templates for MLops projects with GPU:\nhttps://github.com/nogibjj/mlops-template\n","date":"26 February 2022","externalUrl":null,"permalink":"/posts/2022/mlops-week12-data-repositories/","section":"Posts","summary":"\u003cp\u003eWeek 1-2 of the AWS MLops: Data repositories and AWS managed storage choices\u003c/p\u003e","title":"Study notes: MLops Week 1-2 Data Repositories","type":"posts"},{"content":"Week 1-1 of the AWS MLops: Introduction\nAWS Sagemaker Studio Lab # Free environment for prototyping out machine learning projects Based on Jupyter Notebook lab Supports two compute types: CPU and GPU Good integration with Hugging Face Offers terminal window, allowing the access of AWS resources through the aws command AWS CloudShell # No credentials to manage because of the role-based privileges\nConvenient file upload/download GUI\nEasy access to AWS resources such as S3 via the aws command\nFile transfer between CloudShell and S3 buckers File synchronization between CloudShell and S3 buckets Cloud Developer Workspace\nVarious venders: Github CodeSpaces - Easy integration with Github services AWS Cloud9, AWS CloudShell is a lightweight version of Cloud9 GCP Cloud IDE Azure Cloud IDE Advantages comparing with traditional Laptop/Workstation Powerful Disposable Preloaded Notebook-based: GPU + Jupyter Notebook AWS Sagemaker Studio Lab Google Colab Notebooks AWS has pre-built machine learning applications that can be accessed directly in CloudShell\nAdvanced text analytics Automated code reviews Chatbots Demand forecasting Document analysis Search Fraud prevention Image and video analysis \u0026hellip; ","date":"16 February 2022","externalUrl":null,"permalink":"/posts/2022/mlops-week11-aws-ml-technologies/","section":"Posts","summary":"\u003cp\u003eWeek 1-1 of the AWS MLops: Introduction\u003c/p\u003e","title":"Study notes: MLops Week 1-1 AWS Machine Learning Technologies","type":"posts"},{"content":"","date":"25 December 2021","externalUrl":null,"permalink":"/categories/math/","section":"Categories","summary":"","title":"Math","type":"categories"},{"content":" Normal distributions are a class of continuous probability distribution for random variables. They often show up in measurements of physical quantities. For example, the human height is normally distributed. In fact, it is claimed that variables in natural and social sciences are normally or approximately normally distributed. Weight, reading ability, test scores, blood pressure, \\(\\ldots\\)\nThe underlying reason for this is partly due to the central limit theorem. The theorem says that the average of many observations of a random variable with finite average and variance is also a random variable whose distribution converges to a normal distribution as the number of observations increases. This conclusion has a very important implication for Monte Carlo methods which I will discuss in another post.\nIn the simplest case (zero mean and unit variance), the distribution is written as \\begin{equation} f(x) = \\frac{1}{\\sqrt{2\\pi}}e^{-\\frac{x^2}{2}}. \\nonumber \\end{equation} If we plot the function \\(f(x)\\), the curve looks like the following\nThe prefactor \\(1/\\sqrt{2\\pi}\\) of \\(f(x)\\) is its normalization constant which guarantees that the integration of \\(f(x)\\) over the entire real axis is unity \\begin{equation} \\int_{-\\infty}^\\infty f(x)dx = 1. \\nonumber \\end{equation} We all know that the mathematical constant \\(\\pi\\) is the ratio of a circle\u0026rsquo;s circumference to its diameter. This brings up an interesting question: why does \\(\\pi\\) show up in probability distributions that seemingly have nothing to do with circles?\nThe reason, it turns out, boils down to how the normalization of a normal distribution is calculated. A normal distribution belongs to a type of function called the Gaussian function \\(g(x) = e^{-x^2}\\). Its integral is \\begin{equation} \\int_{-\\infty}^\\infty e^{-x^2}dx = \\sqrt{\\pi}.\\nonumber \\end{equation}\nStandard approach # A textbook approach for getting the normalization constant is to compute the square of the Gaussian integral \\begin{equation} \\left(\\int_{-\\infty}^\\infty e^{-x^2}dx\\right)^2 = \\int_{-\\infty}^\\infty\\int_{-\\infty}^\\infty e^{-(x^2+y^2)}dxdy. \\nonumber \\end{equation} In this way, we can leverage the coordinate transformation to map the Cartesian coordinates \\((x,y)\\) into polar coordinates \\((r,\\theta)\\) using the identity \\begin{equation} x^2+y^2 = r^2, \\nonumber \\end{equation} which is exactly an equation of a circle with radius \\(r\\). It is through this equation (transformation) that allows \\(\\pi\\) to have a place in the normal distribution. To continue, the square of the Gaussian integral now becomes\n\\begin{align} \\int_{-\\infty}^\\infty\\int_{-\\infty}^\\infty e^{-(x^2+y^2)}dxdy \u0026amp;= \\int_0^{2\\pi}\\int_0^\\infty e^{-r^2} \\,r drd\\theta \\nonumber \\\\\\ \u0026amp;= 2\\pi\\int_0^\\infty \\frac{1}{2} e^{-y} dy,\\,\\,\\,\\,\\,\\,\\text{where}\\,\\,\\, y=r^2 \\nonumber\\\\\\ \u0026amp;= \\pi. \\nonumber \\end{align}\nAs a result, the normalization of the Gaussian function is \\(\\sqrt{\\pi}\\).\nContour integration # Of course, the method described above is not the only way to normalize the Gaussian integral. There are many proofs ranging from differentiation of an integral, volume integration, \\(\\Gamma\\)-function, asymptotic approximation, Stirling\u0026rsquo;s formula, Laplace\u0026rsquo;s original proof, and the residue theorem.\nSpeaking about the residue theorem, it is a powerful tool for evaluating integrals. The theorem generalizes the Cauchy\u0026rsquo;s theorem and says that the integral of an analytic function \\(f(z)\\) around a closed contour \\(C\\) depends only on the properties of a few special points (singularities, or poles) inside the contour.\n\\begin{equation} \\int_C f(z) dz = 2\\pi i\\sum_{j=1}^n R_j, \\nonumber \\end{equation}\nwhere \\(R_j\\), one of those special points, is called the residue at the point \\(z_j\\):\n\\begin{equation} R_j = \\frac{1}{2\\pi i}\\oint_{C_j} f(z) dz. \\nonumber \\end{equation}\nSo the residue theorem by definition has a prefactor that contains \\(\\pi\\). However, if we want to apply the theorem to the Gaussian function, we would hit a wall. This is because the complex Gaussian function \\(g(z) = e^{-z^2}\\) has no singularities on the entire complex plane. This property has bothered many and in 1914 a mathematician G. N. Watson said in his textbook Complex Integration and Cauchy\u0026rsquo;s Theorem that\nCauchy’s theorem cannot be employed to evaluate all definite integrals; thus \\(\\displaystyle\\int_0^\\infty e^{-x^2}dx\\) has not been evaluated except by other methods.\nFinally in 1940s, several proofs were published using the residue theorem. However there proofs are based on awkward contours and analytic functions that seem to come out of nowhere. For example, Kneser has used the following definition 1 $$ \\frac{e^{-z^2/2}}{1-e^{-\\sqrt{\\pi}(1+i)z}}. $$ Basically the idea is to construct an analytic function that looks like the Gaussian but with poles so that the residue theorem can be applied.\nWhile I like the beauty and power of the residue theorem, the complex analysis proof just seems too complex and artificial for my taste. The standard textbook proof is simple and incorporates naturally the reason why $\\pi$ is there in the normalization. Anyway, I found this little fact a bit interesting.\nH. Kneser, Funktionentheorie, Vandenhoeck and Ruprecht, 1958.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","date":"25 December 2021","externalUrl":null,"permalink":"/posts/2021/gaussial-integral/","section":"Posts","summary":"\u003cp\u003e\n\nNormal distributions are a class of continuous probability distribution for random variables.\nThey often show up in measurements of physical quantities. For example,\n\u003ca href=\"https://ourworldindata.org/human-height#height-is-normally-distributed\" target=\"_blank\"\u003e\nthe human height is normally distributed\u003c/a\u003e. In fact, it is claimed that variables in natural\nand social sciences are normally or approximately normally distributed. Weight, reading\nability, test scores, blood pressure, \\(\\ldots\\)\u003c/p\u003e","title":"Normal distributions and \\\\(\\pi\\\\)","type":"posts"},{"content":"Revisited the Orion Nebula M42. This time with the mighty (comparing to to WO Z61 MK2) Takahashi TSA-120N. M42 is a very bright nebula so it is fairly straightforward to image overall. The most tricky part is probably the Trapezium region where care must be taken in order to prevent overexposure. I managed to get a few hours of LRGB data before the clouds kicked in. Applied a PixInsight HDR trick I learned from the internet to process the Trapezium cluster region.\nIntegrated image # Telescope: Takahashi TSA-120N Camera: ZWO ASI294MM Pro Mount: iOptron CEM40 Filters: ZWO LRGB 31mm Post Processing: Astro Pixel Processor + PixInsight Total Integration time: 2h 30' Check out the photo on my Astrobin\n","date":"5 November 2021","externalUrl":null,"permalink":"/posts/2021/11-05-orion/","section":"Posts","summary":"\u003cp\u003eRevisited the Orion Nebula M42. This time with the mighty (comparing to to WO Z61 MK2) Takahashi TSA-120N.\nM42 is a very bright nebula so it is fairly straightforward to image overall. The most tricky part is probably\nthe Trapezium region where care must be taken in order to prevent overexposure. I managed to get a few hours\nof LRGB data before the clouds kicked in. Applied a PixInsight HDR trick I learned from the internet to\nprocess the Trapezium cluster region.\u003c/p\u003e","title":"Orion Nebula","type":"posts"},{"content":"","externalUrl":null,"permalink":"/authors/","section":"Authors","summary":"","title":"Authors","type":"authors"},{"content":"","externalUrl":null,"permalink":"/series/","section":"Series","summary":"","title":"Series","type":"series"}] \ No newline at end of file +[{"content":" Welcome # Welcome. I use this blog to collect notes and some random stuffs I do. The nickname fermion comes from a concept in quantum manybody physics. It is a class of microscopic particles that is unique in the sense that no two fermions can have the same quantum numbers. To some level, we as humen are also like fermions: each of us (or our state, to be more precise) is unique. Hence the nickname.\nWhat I do for living # I\u0026rsquo;m currently a Research Data Specialist. Before that, I was a Data Scientist working on retail demand forecast and price optimization. Before that, I was a theoretical condensed matter physicist using quantum Monte Carlo methods to study properties of strongly correlated electron systems.\nThings I like to do when I\u0026rsquo;m not working # Jogging and badminton. Because my job requires a lot of screen time, I try to stay away from my computer when I am not at work. Jogging keeps me in shape, and badminton is a racket sport I really like since high school.\nFlute playing. I\u0026rsquo;m a flute enthusiast. I have been taking lessons since 2014. My favorite flutists include Peter-Lukas Graf, Emmanuel Pahud, Patrick Gallois, and Denis Bouriakov. In terms of flute music, I like those written by composers in the Baroque and Classical eras. J. S. Bach\u0026rsquo;s flute sonata and Mozart\u0026rsquo;s flute quartets are my all time favorite. Modern French composers like Debussy, Faure, and Gaubert also have some very sweet flute music.\nAstrophotography. I started doing astrophotography recently. Back then when I was still a student, I was very interested in astrophotography but could not afford to do it because equipments such as the mount and telescope were expensive (they still are today). It is until a few years ago I was able to get into this hobby. The seeing in my neighborhood is pretty decent, so I am able to snap pretty nice pictures in my backyard. Here is my Astrobin page.\n","date":"20 August 2017","externalUrl":null,"permalink":"/about/","section":"Walking in the woods...","summary":"Welcome # Welcome.","title":"About","type":"page"},{"content":"","date":"30 June 2024","externalUrl":null,"permalink":"/categories/","section":"Categories","summary":"","title":"Categories","type":"categories"},{"content":"","date":"30 June 2024","externalUrl":null,"permalink":"/categories/forecast/","section":"Categories","summary":"","title":"Forecast","type":"categories"},{"content":"","date":"30 June 2024","externalUrl":null,"permalink":"/tags/forecasting/","section":"Tags","summary":"","title":"Forecasting","type":"tags"},{"content":" Notes on Chapter 2 of Forecasting: Principles \u0026amp; Practice\nTime series patterns # Trend: Trend exists when there is a long-term increase or decrease in the data. The meaning of long or short is relative and depends on the data\u0026rsquo;s time scale. The trend does not need to be monotonic. It might go from an increasing trend to a decreasing one and vice versa. So long as the time scale of the tendency is larger than the data\u0026rsquo;s time scale, trend can be identified.\nFor example, Fig. 1 shows the weekly passenger load of Ansett Airlines\u0026rsquo; economy class between 1986 and 1993. The data clearly indicates that there is an increasing trend, though bumpy, in the years 1986, 1990, and 1991 that lasted over several months.\nFig. 1 Weekly economy passenger load on Ansett Airlines. Credit: Forecasting: Principles and Practice Fig. 2 summarizes the monthly sales of antidiabetic drugs in Australia between 1992 and 2008. Clearly, this example demonstrates that the sales is trending upward in this time period.\nFig. 2 Monthly sales of antidiabetic drugs in Australia. Credit: Forecasting: Principles and Practice Seasonal: When a time series is affected by seasonal factors such as the time of the year or the day of the week, we say a seasonal pattern exists. Seasonality is always of a fixed and known period. For example, the drug sales data shown in Fig. 2 shows a strong yearly seasonality, and the pattern persists.\nCyclic: A cycle occurs when the data exhibit rises and falls that are not of a fixed frequency. The emphasis on the non-fixed frequency makes it clear that cyclic pattern and seasonality are different concepts. The latter, as Fig. 2 illustrates, has a clear and well-defined period/frequency. The textbook also mentions that in general the length of a cycle is longer than the length of seasonal patterns. For example, Fig. 3 is the Sunspot\u0026rsquo;s activity between 1920 and 2020. The sunspot number has a roughly 11 year cycle. this period is not tied to any seasonal factor such as daily, weekly, or yearly and is much longer. Fig. 3 International Sunspot number. Credit: Forecasting: SpaceWeatherLive.com Deciphering seasonality through visualization # Seasonal plots: If the time series data exhibit seasonality, the pattern can be visualized by seasonal plots. These plots are similar to the usual time series plots except that the data are plotted against the individual seasons, such as sub-daily, daily, weekly, monthly, yearly, \u0026hellip; etc.\nSeasonal subseries plots: This is a plot that emphasizes the seasonal patterns by showing the data in separate mini time series plots. This type of plot is useful in identifying changes within particular seasons.\nScatter plots # When studying the relationship between two time series (for example: electricity usage and temperature), it is useful to plot one series against the other using the scatter plot.\nTo further quantify the relationship, one can calculate the correlation coefficient to measure the strength of the linear relationship between the two time series (variables). The correlation coefficient is defined as $$ r = \\frac{\\sum (x_t - \\bar{x})(y_t - \\bar{y})}{\\sqrt{\\sum(x_t - \\bar{x})^2} \\sqrt{\\sum(y_t - \\bar{y})^2}}. $$ By definition, \\(-1 \\leq r \\leq 1\\). Note that \\(r\\) only measures the strength of linear correlation between two variables. It cannot quantify higher order or more complex correlations. Therefore, one cannot rely solely on the correlation coefficient when looking at the relationship between variables.\n","date":"30 June 2024","externalUrl":null,"permalink":"/posts/2024/06-30-forecasting-2/","section":"Posts","summary":"Notes on Chapter 2 of Forecasting: Principles \u0026amp; Practice","title":"Forecasting: Chap. 2","type":"posts"},{"content":"","date":"30 June 2024","externalUrl":null,"permalink":"/posts/","section":"Posts","summary":"","title":"Posts","type":"posts"},{"content":"","date":"30 June 2024","externalUrl":null,"permalink":"/categories/study-notes/","section":"Categories","summary":"","title":"Study Notes","type":"categories"},{"content":"","date":"30 June 2024","externalUrl":null,"permalink":"/tags/","section":"Tags","summary":"","title":"Tags","type":"tags"},{"content":"","date":"30 June 2024","externalUrl":null,"permalink":"/","section":"Walking in the woods...","summary":"","title":"Walking in the woods...","type":"page"},{"content":" Notes on Chapter 1 of Forecasting: Principles \u0026amp; Practice\nWhat can be forecast # Factors that affect the predictability of an event or a quantity\nHow well we understand the factors that contribute to it; How much data is available; How similar the future is to the past; Whether the forecasts can affect the thing we are trying to forecast. Example 1: Short-term forecasts of residential electricity demand\nTemperature is a primary driving force of the demand, especially in summer. Historical electricity usage data is available. It\u0026rsquo;s safe to assume that the demand behavior will be similar to that in the past. I.e., there is some degree of seasonality. Price of the electricity is not strongly dependent on the demand. So the forecast will not have too much effect on consumer behavior. Example 2: Currency exchange rate\nWe have very limited knowledge about what really contributes to exchange rates. There is indeed a lot of historical exchange rate data available. Very difficult to say that the future will be similar to the past. The market is very sensitive to a number of unpredictable factors such as political situation, a country\u0026rsquo;s financial stability and economy policies, \u0026hellip; etc. The exchange rate is bound to have strong correlation to the forecast outcome, as the market will response to any forecast results. This is called the efficient market hypothesis. Based on the predictability criterion, the exchange rate is likely not predictable. In fact, things like stock price and lottery number fall in this category.\nDetermine what to forecast # It is necessary to consider the forecasting horizon. Will it be one month in advance, 6 months, or for multiple years. Depending on the forecast horizon, different types of models will be necessary.\nForecasting data and methods # Qualitative forecasting: If there are no data available, or the data are not relevant to the forecast. Quantitative forecasting can be applied when: Numerical information about the past is available. It is reasonable to assume that some aspects of the past patterns will continue into the future. Forecasting models # Explanatory model: In this scenario, the historical behavior of a time series is assumed to be captured by the so-called predictor variables. For example, the hourly electricity demand \\(d\\) of a hot region during summer can be modeled by $$ d = F(\\text{temperature}, \\text{population}, \\text{time of day}, \\text{day of week}, \\text{error} ). $$\nThe relationship is not exact, but these variables are primary factors that are likely to impact the electricity demand. This type of model explains what causes the variation in electricity demand.\nTime series model: Electricity demand data form a time series. Hence a time series model can be used for forecasting. In this case, the demand \\(d_{t+1}\\) at time \\(t+1\\) is expressed as follows $$ d_{t+1} = F(d_t, d_{t-1}, d_{t-2}, \\ldots, \\text{error}), $$ where \\(t\\) represents the current hour, \\(t+1\\) is the next hour, \\(t-1\\) is the previous hour, and so on. The prediction of the future is based on past values of a variable but not on external variables that may affect the system.\nMixed models: The combination of the above two models $$ d_{t+1} = F( d_t, \\text{temperature}, \\text{population}, \\text{time of day}, \\ldots, \\text{error}), $$\n","date":"26 June 2024","externalUrl":null,"permalink":"/posts/2024/06-28-forecasting-1/","section":"Posts","summary":"Notes on Chapter 1 of Forecasting: Principles \u0026amp; Practice","title":"Forecasting: Chap. 1","type":"posts"},{"content":"","date":"24 November 2022","externalUrl":null,"permalink":"/categories/astrophotography/","section":"Categories","summary":"","title":"Astrophotography","type":"categories"},{"content":"","date":"24 November 2022","externalUrl":null,"permalink":"/tags/nebula/","section":"Tags","summary":"","title":"Nebula","type":"tags"},{"content":"Westerhout 5, or known by most as Soul Nebula, is an emission nebula located in Cassiopeia. This is a large star forming region like the Orion Nebula. The dark hollow areas embedded in the blue regions are cavities that were carved out by radiation and winds from the region\u0026rsquo;s massive stars.\nIntegrated image # Telescope: Takahashi FSQ-85EDX Camera: ZWO ASI294MM Pro Mount: iOptron CEM40 Filters: Astronomik H-alpha 6nm 1.25\u0026quot;, SII 6nm 1.25\u0026quot;, OIII filters 6nm 1.25\u0026quot; Post Processing: Astro Pixel Processor + PixInsight Total Integration time: 28h 25' Check out the photo on my Astrobin\n","date":"24 November 2022","externalUrl":null,"permalink":"/posts/2022/11-24-soul-nebula/","section":"Posts","summary":"\u003cp\u003eWesterhout 5, or known by most as Soul Nebula, is an emission nebula located in Cassiopeia.\nThis is a large star forming region like the Orion Nebula. The dark hollow areas embedded\nin the blue regions are cavities that were carved out by radiation and winds from the region\u0026rsquo;s\nmassive stars.\u003c/p\u003e","title":"Soul Nebula","type":"posts"},{"content":"The Veil Nebula. This is a 2x3 Mosaic narrow-band imaging project that took me almost two months (9/4/2022 ~ 11/1/2022) to complete. A total of 2155 subs with 179 hours of accumulated exposure time. This nebula is a magnificent supernova remnant in the area of Cygnus, one of my favorite constellations.\nIntegrated image # Telescope: Takahashi FSQ-85EDX Camera: ZWO ASI294MM Pro Mount: iOptron CEM40 Filters: Astronomik H-alpha 6nm 1.25\u0026quot;, SII 6nm 1.25\u0026quot;, OIII filters 6nm 1.25\u0026quot; Post Processing: Astro Pixel Processor + PixInsight Total Integration time: 179h 35' Check out the photo on my Astrobin\n","date":"4 November 2022","externalUrl":null,"permalink":"/posts/2022/11-04-veil/","section":"Posts","summary":"\u003cp\u003eThe Veil Nebula. This is a 2x3 Mosaic narrow-band imaging project that took me almost two months (9/4/2022 ~ 11/1/2022)\nto complete. A total of 2155 subs with 179 hours of accumulated exposure time. This nebula is a magnificent supernova\nremnant in the area of Cygnus, one of my favorite constellations.\u003c/p\u003e","title":"Veil Nebula","type":"posts"},{"content":"SH2-129 Flying Bat (red) and OU-4 Squid Nebulae. The combo can be found in the constellation Cepheus. Both are in the family of emission nebula. The Flying Bat consists of mostly ionized hydrogen gas while the Squid is a region full of ionized oxygen clouds. Due to its extremely dimness, the Squid was discovered only recently in 2011 by French astrophotographer Nicolas Outters. The formation of the Squid is still under debate. But many astronomers believe the bright star HD202214 (near the center of the photo) plays a major role.\nTwo years into the hobby, I decided to take on the challenge. And, not surprisingly, this is the most demanding project by far. The total integration time is 72 hours. I knew the squid is an extremely faint emission nebula before started. However, it\u0026rsquo;s more difficult than I expected to get enough photons even in a single 10 min sub for a f/5.3 scope and under Bottle 5/6 skies. Luckily, the weather in Northern California was almost perfect in the past 20 days or so during the project. Cloud coverage was constantly around or below 6-8%, the worst being 12-15%.\nApart from a guiding issue I still need to sort out, imaging the objects was pretty smooth. Processing the data, however, is another story. For my setup, the signal for the squid is not much above the background noise level. Stretching the squid without blowing up the noise was really pushing my processing skills to the limit. I ended up using a mask in order to suppress the noise in the nonlinear stage. All in all, I\u0026rsquo;ve learned a lot from the project. Hope you will like the results.\nIntegrated image # Telescope: Takahashi FSQ-85EDX Camera: ZWO ASI294MM Pro Mount: iOptron CEM40 Filters: Astronomik Deep-Sky RGB 31mm + Astronomik H-alpha 6nm 1.25\u0026quot; + OIII 6nm 1.25\u0026quot; Post Processing: Astro Pixel Processor + PixInsight Total Integration time: 71h 50' Check out the photo on my Astrobin\n","date":"4 July 2022","externalUrl":null,"permalink":"/posts/2022/07-04-squid/","section":"Posts","summary":"\u003cp\u003eSH2-129 Flying Bat (red) and OU-4 Squid Nebulae. The combo can be found in the constellation Cepheus.\nBoth are in the family of emission nebula. The Flying Bat consists of mostly ionized hydrogen gas while\nthe Squid is a region full of ionized oxygen clouds. Due to its extremely dimness, the Squid was\ndiscovered only recently in 2011 by French astrophotographer Nicolas Outters. The formation of\nthe Squid is still under debate. But many astronomers believe the bright star HD202214 (near the center\nof the photo) plays a major role.\u003c/p\u003e","title":"Squid Nebula","type":"posts"},{"content":"M13, the Great Globular Cluster. Located in the constellation of Hercules., it is roughly 27,100 light years away from Earth. The cluster is the brightest globular cluster in the northern hemisphere, visible to the naked eye under dark skies. My first astrophotography image taken with the Takahashi FSQ-85EDX, also called the Baby-Q.\nIntegrated image # Telescope: Takahashi FSQ-85EDX Camera: ZWO ASI294MM Pro Mount: iOptron CEM40 Filters: ZWO LRGB 31mm Post Processing: Astro Pixel Processor + PixInsight Total Integration time: 3h 40' Check out the photo on my Astrobin\n","date":"5 June 2022","externalUrl":null,"permalink":"/posts/2022/06-05-m13/","section":"Posts","summary":"\u003cp\u003eM13, the Great Globular Cluster. Located in the constellation of Hercules., it is roughly 27,100 light\nyears away from Earth. The cluster is the brightest globular cluster in the northern hemisphere, visible\nto the naked eye under dark skies. My first astrophotography image taken with the Takahashi FSQ-85EDX,\nalso called the \u003cstrong\u003eBaby-Q\u003c/strong\u003e.\u003c/p\u003e","title":"M13 Cluster","type":"posts"},{"content":"","date":"5 June 2022","externalUrl":null,"permalink":"/tags/star-cluster/","section":"Tags","summary":"","title":"Star Cluster","type":"tags"},{"content":"Elephant\u0026rsquo;s Trunk and the IC1396. Standard SHO narrow band image. Elephant\u0026rsquo;s trunk (the dark cloud in the left) is an area of interstellar gas and dust in the large ionized gas region IC1396. The area is located in the constellation Cepheus, about 2400 light years away from Earth.\nThe entire area is very dynamic. The IC1396 is being illuminated and ionized by a bright and massive star HD 206267 seen near the center of the frame. The Elephant\u0026rsquo;s Trunk region is currently thought to be a star formation site containing many young stars (less than 100,000 years old).\nIntegrated image # Telescope: Takahashi FSQ-85EDX Camera: ZWO ASI294MM Pro Mount: iOptron CEM40 Filters: Astronomik H-alpha 6nm 1.25\u0026quot; + SII 6nm 1.25\u0026quot; + OIII 6nm 1.25\u0026quot; Post Processing: Astro Pixel Processor + PixInsight Total Integration time: 15h 35' Check out the photo on my Astrobin\n","date":"30 May 2022","externalUrl":null,"permalink":"/posts/2022/05-30-elephans-trunk/","section":"Posts","summary":"\u003cp\u003eElephant\u0026rsquo;s Trunk and the IC1396. Standard SHO narrow band image. Elephant\u0026rsquo;s trunk (the dark cloud in the left) is an area of interstellar gas and dust in the large ionized gas region IC1396. The area is located in the constellation Cepheus, about 2400 light years away from Earth.\u003c/p\u003e\n\u003cp\u003eThe entire area is very dynamic. The IC1396 is being illuminated and ionized by a bright and massive star HD 206267 seen near the center of the frame. The Elephant\u0026rsquo;s Trunk region is currently thought to be a star formation site containing many young stars (less than 100,000 years old).\u003c/p\u003e","title":"IC1396 Elephan's Trunk Nebula","type":"posts"},{"content":"","date":"9 April 2022","externalUrl":null,"permalink":"/tags/data-science/","section":"Tags","summary":"","title":"Data Science","type":"tags"},{"content":"","date":"9 April 2022","externalUrl":null,"permalink":"/tags/mlops/","section":"Tags","summary":"","title":"MLops","type":"tags"},{"content":"Quick notes of AWS MLops, Week 4\nTalk about availability, scalability, and resilience # Monitoring and Logging\nView point: Data science for Software system AWS CloudWatch Dashboards Search Alerts Automated insights Use CloudWatch to pull in info from servers hosting source codes and monitoring agents (e.g. CPU metrics, Memory metrics, and DISK I/O metrics) Multiple Regions\nResources are distributed across isolated geographic regions which have multiple availability zones Create as many as redundant infrastructures Increase resilience Reproducible Workflows\nInfrastructure as code (IAC) workflow: the idea behind infrastructure as code is that there isn\u0026rsquo;t a human pushing buttons that make something happen. The code is triggered to built by events. \u0026lt;/figure\u0026gt; Implement Appropriate ML Services # Comparisons between higher infrastructure control and faster application development and deployment Provisioning EC2\nCan launch EC2 from console, SDK, or CLI. Sub-components: User data: could put special instructions here Storage: EBS versus Instance Security Group: firewall rules for the EC2 launch SSH Key pair Have Amazon Machine Image? Instances type: CPU vs. GPU Cost: On demand vs. Spot Virtual Private Cloud (VPC) IAM Role Provisioning EBS, Elastic Beanstalk, various possibilities of building on top of AWS platform\nKey idea: Elastic Beanstalk can scale up and down resources automatically according to the health metrics from the load balancer. The provisioning model is elastic. Block storage can be provisioned to have high bandwidth The user decides which parts should be pre-provisioned and which parts should be elastic. Example: You need to have extremely high band width storage for doing machine learning training where you had a cluster of machines all talking to the same amount point. AWS ML Services\nMany high level ML services provided Provides both GUI, Console access through boto3 Examples and tutorials boto3 documentations have all API call details Deploying and Secure ML Solutions # Principle of Least Privilege AWS Lambda\nConfigure the Lambda micro service to have the minimal privilege necessary for accessing upstream (e.g. AWS S3) and downstream (e.g. DynamoDB) access. Can be achieved through IAM role based policies Integrated Security\nAWS security firewall, blocking incoming ports via role-based privileges Within the firewall, data transfer is encrypted between source and AWS S3 object storage Everything is inside a virtual private cloud (AWS VPC) Use automated deployment or infrastructure as code, no need to worry about making manual mistakes Audit: AWS CloudTrail monitors API calls and all actions that is occurring in the network Use AWS SageMaker Studio to prepare data, build model, train \u0026amp; tune model, and deploy. This platform provides launchers that have many models and templates for jump-starting any projects\nAWS SageMaker Canvas: Using Canvas is a great way to understand at a high level around machine learning problems solved, and you can also do this by building your own machine learning system with SageMaker or building your own system outside using AWS Cloud9.\nData Drift and Model Monitoring:\nUse data to train the first model New data comes in, triggers a data drift alert saying a new model is needed New data is combined with the old one and a new model is trained, registered, and deployed ","date":"9 April 2022","externalUrl":null,"permalink":"/posts/2022/mlops-week4/","section":"Posts","summary":"\u003cp\u003eQuick notes of AWS MLops, Week 4\u003c/p\u003e","title":"Study notes: MLops Week 4","type":"posts"},{"content":"Week 3-3 of the AWS MLops: Computer vision and AWS Rekognition\nComputer Vision and Amazon Rekognition # Computer vision # Automated extraction of information from digital images Applications Public safety and home security Authentication and enhanced computer-human interaction Content management and analysis Autonomous driving Medial imaging Manufacturing process control Computer vision problems: Image analysis Object classification Object detection Object segmentation Video analysis Instance tracking, pathing Action recognition Motion estimation Amazon Rekognition # Managed service for image and video analysis\nTypes of analysis:\nSearchable image and video libraries Face-based user verification Sentiment and demographic analysis Unsafe content detection Can add powerful visual analysis to application\nHighly scalable and continuously learns\nIntegrates with other AWS services\nExamples:\nSearchable image library \u0026lt;/figure\u0026gt; Image moderation \u0026lt;/figure\u0026gt; Sentiment analysis \u0026lt;/figure\u0026gt; AWS services used in these examples: S3 Lambda Rekognition Elasticsearch Service Kinesis Video Streams Kinesis Data Streams Redshift QuickSight Custom Labels # Example use cases Search logos Identify products Identify machine parts Distinguish between healthy and infected plants Almost all vision solutions start with an existing model Custom labeling process Collect images Collect few hundred images Build domain-specific models 10 PNG or JPEG images per label Use images similar to the images that you want to detect Create dataset Images, labels, and bounding boxe Need at least two labels Label images by using console or Amazon Sagemaker Ground Truth Model evaluation Precision, recall Overall model performance Improve the model Better and more data Reduce false positives (better precision): could add more classes as labels for training Reduce false negatives (better recall): use better data or more precise classes (labels) for training Adjust the confidence threshold to tune precision/recall Use model Apply the model on new images and collect custom labels: label, object bounding box, and confidence level ","date":"3 April 2022","externalUrl":null,"permalink":"/posts/2022/mlops-week33-computer-vision/","section":"Posts","summary":"\u003cp\u003eWeek 3-3 of the AWS MLops: Computer vision and AWS Rekognition\u003c/p\u003e","title":"Study notes: MLops Week 3-3 Computer Vision","type":"posts"},{"content":"Week 3-2 of the AWS MLops: AWS Forecast and time-series data\nForecasting and AWS Forecast # Overview # Predicting future values that are based on historical data Patterns include Trends Seasonal, pattern that is based on seasons Cyclical, other repeating patterns Irregular, patterns that might appear to be random Examples Sales and demand forecast Energy consumption Inventory projections Weather forecast Processing time series data # Time series data is captured in sequence over time\nHandle missing data\nForward fill Backward fill Moving average Interpolation: linear, spline, or polynomial Sometimes zero is a good fill value Reasmpling: Resampling time series data allows the flexibility of defining resolution of the data\nUpsampling: increase the sample frequency, e.g. from minutes to seconds. Care must be taken in deciding how the fine-grained samples are computed. Downsampling: decrease the sample frequency, e.g. from days to months. Need to pay attention to how the aggregation is carried out. Reasons for resampling: Inspect the behavior of data under different resolutions Join tables with different resolutions Sampling smoothing, including outlier removal\nWhy Part of the data preparation process For visualization How does smoothing affect the outcome Cleaner data to model Model compatibility Production improvement? Seasonality\nHourly, daily, quarterly, yearly Spring, summer, fall, winter Holidays Time series sample correlations\nStationary How stable is the system Does the past inform the future Trends Correlation issues Autocorrelation How points in time series sample are linearly related pandas offer many methods for handling time series data\nTime-aware index groupby and resample() autocorr() method Times series algorithms offered by Amazon Forecast\nARIMA, autoregressive integrated moving average DeepAR+ Exponential Smoothing (ETS) Non-Parametric Time Series (NPTS) Prophet Model evaluation\nTime series data model training cannot use $k$-fold cross validation because the data is ordered and correlated. Standard approach: back testing \u0026lt;/figure\u0026gt; Two metrics can be used to access the backtesting (hindcasting instead of forecasting) accuracy wQuantileLoss: the average error for each quantile in a set RMSE, root mean square error ","date":"28 March 2022","externalUrl":null,"permalink":"/posts/2022/mlops-week32-forecasting/","section":"Posts","summary":"\u003cp\u003eWeek 3-2 of the AWS MLops: AWS Forecast and time-series data\u003c/p\u003e","title":"Study notes: MLops Week 3-2 Forecasting","type":"posts"},{"content":"Week 3-1 of the AWS MLops: ML steps, pipeline, and AWS SageMaker\nMachine Learning Pipeline and AWS SageMaker # Forming business problem # Define business objective, questions to ask:\nHow is this task done today? How will the business measure success? How will the solution be used? Do similar solutions exist? What are the assumptions? Who are the domain experts? Frame the business problem\nIs it a machine learning problem? What kind of ML problem is it? Classification or regression? Is the problem supervised or unsupervised? What is the target to predict? Have access to the relevant data? What is the minimum or baseline performance? Would you solve the problem manually? What is the simplest solution? Can the success be measured? Collect and secure data, ETL # Data sources\nPrivate data Commercial data: AWS Data Exchange, AWS Marketplace, $\\ldots$ Open-source data Kaggle World Health Organization US Census Bureau National Oceanic and Atmospheric Administration UC Irvine Machine Learning repository AWS Critical question: Is you data representative? ETL with AWS Glue\nRuns the ETL process Crawls data sources to create catalogs that can be queried ML functionality AWS Glue can glue together different datasets and emit a single endpoint that can be queried Data security: Access control and Data encryption\nControl access using AWS Identity and Access Management (IAM) policy AWS S3 default encryption AWS RDS encryption AWS CloudTrail: tracks user activity and application programing interface (API) usage Data evaluation # Make sure the data is in the correct format Use descriptive statistics to gain insights into the dataset before cleaning the data Overall statistics Categorical statistics can identify frequency of values and class imbalances Multivariate statistics Scatter plot to inspect the correlation between two variables pandas provides scatter_matrix method to examine multivariate correlations Correlation matrix and heat map Attribute statistics Feature engineering # Feature extraction\nData encoding\nCategorical data must be converted to a numeric scale If data is non-ordinal, the encoded value must be non-ordinal which might need to be broken into multiple categories Data cleaning\nVariations in strings: text standardization Variations in scale: scale normalization Columns with multiple data items: parse into multiple columns Missing data: Cause of missing data: undefined values data collection errors data cleaning errors Plan for missing data: ask the following questions first What were the mechanisms causing the missing data? Are values missing at random? Are rows or columns missing that you are not aware of? Standard approaches Dropping missing data Imputing missing data Outliers Finding the outliers: box plots or scatter plots for visualization Dealing with outliers: Delete - e.g. outliers were created by artificial errors Transform - reduce the variation Impute - e.g. use mean for the outliers Feature selection\nFilter method\nPearson\u0026rsquo;s correlation Linear discriminant analysis (LDA) Analysis of variance (ANOVA) Chi-square $\\chi^2$ analysis Wrapper method\nForward selection Backward selection \u0026lt;/figure\u0026gt; Embedded method\nDecision trees LASSO and RIDGE \u0026lt;/figure\u0026gt; Model training and evaluation # Holdout method\nSplit the data into training and test sets The model is trained on the training set. Afterwards, its performance is evaluated by testing the model on the test set data which the model has never touched. Advantage: straightforward to implement and computationally cheap because training and testing are carried out once each. Disadvantage: It could happen that the test set and the training set have different statistical distributions, i.e. the test set data cannot faithfully represent the training set distribution. In this case, the validation result is likely not accurate. If we tune the model based on a single test set, we may end up overfitting the test data set. While this approach can be improved by using training, validation, and test set, the result might still depend on the way data sets are prepared, leading to some degrees of bias. $k$-fold cross-validation method, an evaluation method that minimizes the disadvantages of the holdout method.\nDivide the whole data set into training and test set. Shuffle the training set randomly, if possible.1 Split the training set into $k$ non-overlapping subsets (folds) that are equally partitioned, if possible. For each of the $k$ folds: Train a new model on the $k-1$ folds and validate using the remaining fold. Retain the evaluation score and discard the model. The performance metric is obtained by averaging the $k$ evaluation scores. The test set is used for final evaluation. \u0026lt;/figure\u0026gt; To avoid data leakage, any feature engineering should be carried out separately for training and validation inside the CV loop. Reference for practical advice on cross-validation, including imbalanced data set Evaluation\nClassification problems Confusion matrix F1 score, the harmonic mean of precision and sensitivity AUC-ROC Regression Mean squared error Model tuning # Amazon Sagemaker offers automated hyperparamter tuning Best practices Don\u0026rsquo;t adjust every hyperparameter Limit the range of values to what\u0026rsquo;s most effective Run one training job at a time instead of multiple jobs in parallel In distributed training jobs, make sure that the objective metric that you want is the one that is reported back With Amazon SageMaker, convert log-scaled hyperparameters to linear-scaled when possible ]: Time-series data is ordered and can\u0026rsquo;t be shuffled.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","date":"20 March 2022","externalUrl":null,"permalink":"/posts/2022/mlops-week31-model-selection/","section":"Posts","summary":"\u003cp\u003eWeek 3-1 of the AWS MLops: ML steps, pipeline, and AWS SageMaker\u003c/p\u003e","title":"Study notes: MLops Week 3-1 Machine Learning pipeline","type":"posts"},{"content":"Week 2 of the AWS MLops: Data preparation using AWS services\nMachine Learning (ML) and AWS ML Services # Deep learning $\\subset$ Machine learning $\\subset$ Artificial intelligence\nML is the scientific study of algorithms and statistical models to perform a task using inference instead of instructions\nTypical workflow: Data → Model → Prediction\nTypes of ML algorithms and common business use cases\nSupervised learning: Fraud detection Image recognition - computer vision Customer retention Medical diagnostics - computer vision Personalized advertising Sales prediction Weather forecasting Market projection Population growth prediction $\\ldots$ Unsupervised learning: Product recommendations Customer segmentation Targeted marketing Medical diagnostics Natural language processing - chatbot, translation, sentiment analysis Data structure discovery Gene sequencing $\\ldots$ Reinforcement learning: best when the desired outcome is known but the exact path to achieving it is not known Game AI Self-driving cats Robotics Customer service routing $\\ldots$ Use ML when you have:\nLarge datasets, large number of variables Lack of clear procedures to obtain solutions Existing ML expertise Infrastructure already in place to support ML Management support for ML Typical ML workflow Iterative process Data processing Training Evaluation ML frameworks and infrastructure\nFrameworks provide tools and code libraries Customized scripting Integration with AWS services Community of developers Example: PyTorch, TensorFlow, scikit-learn, $\\ldots$ Infrastructure Designed for ML applications AWS IoT Greengrass provides an infrastructure for building ML for IoT devices AWS Elastic Inference reduces costs for running ML apps AWS ML managed services, no ML experience required\nComputer vision: Amazon Rekognition, Amazon Textract Speech: Amazon Polly, Amazon Transcribe Language: Amazon Comprehend, Amazon Translate Chabots: Amazon Lex Forecasting: Amazon Forecast Recommendations: Amazon Personalize Three layers of the Amazon Machine Learning stack:\nManaged Services Machine Learning Services MAchine Learning Frameworks ML challenges\nData Poor quality Non-representative Insufficient Overfitting and underfitting Business Complexity in formulating questions Explaining models to business stakeholders Cost of building systems Users Lack of data science expertise Cost of staffing with data scientists Lack of management support Technology Data privacy issue Tool selection can be complicated Integration with other systems Feature Engineering # Public dataset for feature engineering and model tuning\nHugging Face public datasets Kaggle public datasets Amazon S3 buckets An useful concept: combine old features and produce new features for training/validation\nHelpful to create a ML project structure so that the project can be managed and tracked phase by phase\nData ingest Exploratory data analysis (EDA) Modeling Conclusion At the EDA phase, typical approaches\nLook at descriptive statistics Graphing data, examine trends: linear, logarithmic, $\\ldots$ Clustering data ","date":"14 March 2022","externalUrl":null,"permalink":"/posts/2022/mlops-week2-aws-ml-data-preparation/","section":"Posts","summary":"\u003cp\u003eWeek 2 of the AWS MLops: Data preparation using AWS services\u003c/p\u003e","title":"Study notes: MLops Week 2 AWS ML Data Preparation","type":"posts"},{"content":"Week 1-3 of the AWS MLops: Data ingestion and AWS jobs\nAWS job styles: # Batch Glue: creates metadata that allows to perform operations on e.g. S3 or a database. This is a serverless ETL system. Batch: general purpose batch, can process anything at scale in containers and training models with GPUs. Step functions: parameterize different steps, orchestrate Lambda functions with inputs. Streaming Kinesis: send in small payloads and process them as it receives the payloads. Kafka via Amazon MSK In terms of operation complexity and data size, here is a high level comparison\nBatch Streaming complexity simple complex data size large small Complexity: Batch jobs are simpler, they receive data, execute operations across, then give back results. Streaming jobs on the other hand need to take in data as they come in and a bit more prone to error and mistake. Data size: Batch jobs are good at handling large data payloads since they are designed to process in batch. While streaming jobs process things as they come in. Batch: data ingestion and processing pipelines # Examples # Example 1 - AWS Batch: Event trigger creates new jobs New jobs are stored in queue. Can have thousands of jobs. Each job launches its own container and performs things like fine tuning Hugging Face models using GPUs. Example 2 - AWS Step Function: Event trigger First step, a Lambda function, gets JSON payload and exports results. Second step, also a Lambda function, takes outputs from the previous step as inputs. Exports results as JSON. Example 3 - AWS Glue, an ETL pipeline: Event trigger AWS Glue points to multiple data sources: CSV files in S3 or external PostgreSQL database. Glue ties multiple data sources together and creates an ETL then transform the data and put it into a S3 bucket. Glue creates a data catalog that can be queried via AWS Athena without having to actually pull all the data out of S3 for data visualization and maybe manipulation. ","date":"9 March 2022","externalUrl":null,"permalink":"/posts/2022/mlops-week13-data-ingestion/","section":"Posts","summary":"\u003cp\u003eWeek 1-3 of the AWS MLops: Data ingestion and AWS jobs\u003c/p\u003e","title":"Study notes: MLops Week 1-3 Data Ingestion and Transformation","type":"posts"},{"content":"SH2-252 Monkey Head Nebula. Located in the Orion Constellation. Embedded in the nebula is an open cluster NGC 2175. The cluster is about 6350 light years from Earth. The nebula has active ionized oxygen, sulfur, and hydrogen clouds. A very good narrowband subject. Does the nebula look like a monkey head? I can\u0026rsquo;t really tell\u0026hellip;😂\nIntegrated image # Telescope: Takahashi TSA-120N Camera: ZWO ASI294MM Pro Mount: iOptron CEM40 Filters: Astronomik H-alpha 6nm 1.25\u0026quot; + SII 6nm 1.25\u0026quot; + OIII 6nm 1.25\u0026quot; Post Processing: Astro Pixel Processor + PixInsight Total Integration time: 9h Check out the photo on my Astrobin\n","date":"4 March 2022","externalUrl":null,"permalink":"/posts/2022/03-04-monkey/","section":"Posts","summary":"\u003cp\u003eSH2-252 Monkey Head Nebula. Located in the Orion Constellation. Embedded in the nebula is an open\ncluster NGC 2175. The cluster is about 6350 light years from Earth. The nebula has active ionized oxygen,\nsulfur, and hydrogen clouds. A very good narrowband subject. Does the nebula look like a monkey head?\nI can\u0026rsquo;t really tell\u0026hellip;😂\u003c/p\u003e","title":"NGC2175, Monkey Head Nebula","type":"posts"},{"content":"IC434 Horsehead Nebula. This nebula is probably one of the most imaged subject in astrophotography. IC434 is a dark nebula in the constellation Orion. It is about 1375 light years away from Earth. The nebula got its name because the dark dust cloud looks just like a horse\u0026rsquo;s head. Behind the Horsehead nebula is a huge area of ionized hydrogen gas lit up by the star group Sigma Orions (at the top of the image)\nIntegrated image # Telescope: Takahashi TSA-120N + TOA-35 Reducer Camera: ZWO ASI294MM Pro Mount: iOptron CEM40 Filters: ZWO RGB 31mm + Astronomik H-alpha 6nm 1.25\u0026quot; Post Processing: Astro Pixel Processor + PixInsight Total Integration time: 20h 45' Check out the photo on my Astrobin\n","date":"28 February 2022","externalUrl":null,"permalink":"/posts/2022/02-28-horsehead/","section":"Posts","summary":"\u003cp\u003eIC434 Horsehead Nebula. This nebula is probably one of the most imaged subject in astrophotography.\nIC434 is a dark nebula in the constellation Orion. It is about 1375 light years away from Earth.\nThe nebula got its name because the dark dust cloud looks just like a horse\u0026rsquo;s head. Behind the\nHorsehead nebula is a huge area of ionized hydrogen gas lit up by the star group Sigma Orions\n(at the top of the image)\u003c/p\u003e","title":"Horsehead Nebula","type":"posts"},{"content":"Week 1-2 of the AWS MLops: Data repositories and AWS managed storage choices\nIntroduction to AWS S3 # Object storage service that offers scalability, data availability, security, and performance.\n99.99999999999% of durability Easy to use management features Can respond to event triggers Use cases:\nContent storage/distribution Backup, restore, and archive Data lakes and big data analytics Disaster recovery Static website hosting Component:\nBucket: https://s3-\u0026lt;aws-region\u0026gt;.amaonaws.com/\u0026lt;bucket-name\u0026gt;/ Object: https://s3-\u0026lt;aws-region\u0026gt;.amaonaws.com/\u0026lt;bucket-name\u0026gt;/\u0026lt;object-key\u0026gt; Objects in an S3 bucket can be referred by their URL The key value identifies the object in the bucket Prefixes:\nUse prefixes to imply a folder structure in an S3 bucket Specify prefix: 2021/doc-example-bucket/math Returns the following kets: 2021/doc-example-bucket/math/john.txt 2021/doc-example-bucket/math/maris.txt Object metadata:\nSystem-defined: objection creation data object size object version User-defined: information that you assign to the object x-amz-meta key followed by a custom name. Example: x-amz-meta-alt-name Versioning:\nKeep multiple variants of an object in the same bucket In versioning-enabled S3 buckets, each object has a version ID After versioning is enabled, it can only be suspended. Three operations:\nPUT: Upload entire object to a bucket. Max size: 5 GB Should use multipart upload for objects over 100 MB import boto3 S3API = boto3.client(\u0026#34;s3\u0026#34;, region_name=\u0026#34;us-east-1\u0026#34;) bucket_name = \u0026#34;samplebucket\u0026#34; filename = \u0026#34;/resources/website/core.css\u0026#34; S3API.upload_file( filename, bucket_name, \u0026#34;core.css\u0026#34;, ExtraArgs={\u0026#39;ContentType\u0026#39;: \u0026#34;text/css\u0026#34;, \u0026#34;CacheControl\u0026#34;: \u0026#34;max-age=0\u0026#34;}) GET: Used to retrieve objects from Amazon S3 Can retrieve the complete object at once or a range of bytes DELETE: Versioning disabled - object is permanently deleted from the bucket Versioning enabled - delete with key and version ID S3 SELECT:\nA powerful too to query data in place without the need to fetch the data from buckets. Data encryption: S3 has two types of policies for bucket access:\nACLs: access control lists. Resource-based access policy to manage access at the object level or bucket level. Data storage # As the first step, catalog all of the different data source in the organization into a master list. Once the master list is created, develop a strategy around how to process the data in a data engineering pipeline\nDetermine the correct storage medium\nDatabase Key/value database, e.g. DynamoDB, ideal for user records or game stats Graph database, e.g. Neptune, for relationship building SQL, e.g. Amazon Aurora, RDS, for transaction-based queries Data lake: Built on top of S3 Metadata + Storage + Compute Can index things that are inside S3 EFS Elastic File System Amazon EFS is a cloud-based file storage service for apps and workloads that run in the AWS public cloud Automatically grows and shrinks as you add and remove files The system manages the storage size automatically without any provisioning EBS Stands for Elastic Block Storage. This is a high-performance block-storage service designed for AWS Elastic Compute Cloud (AWS EC2) Offers very fast file system, ideal for machine learning training that requires fast file IO MLOps Template Github # Great templates for MLops projects with GPU:\nhttps://github.com/nogibjj/mlops-template\n","date":"26 February 2022","externalUrl":null,"permalink":"/posts/2022/mlops-week12-data-repositories/","section":"Posts","summary":"\u003cp\u003eWeek 1-2 of the AWS MLops: Data repositories and AWS managed storage choices\u003c/p\u003e","title":"Study notes: MLops Week 1-2 Data Repositories","type":"posts"},{"content":"Week 1-1 of the AWS MLops: Introduction\nAWS Sagemaker Studio Lab # Free environment for prototyping out machine learning projects Based on Jupyter Notebook lab Supports two compute types: CPU and GPU Good integration with Hugging Face Offers terminal window, allowing the access of AWS resources through the aws command AWS CloudShell # No credentials to manage because of the role-based privileges\nConvenient file upload/download GUI\nEasy access to AWS resources such as S3 via the aws command\nFile transfer between CloudShell and S3 buckers File synchronization between CloudShell and S3 buckets Cloud Developer Workspace\nVarious venders: Github CodeSpaces - Easy integration with Github services AWS Cloud9, AWS CloudShell is a lightweight version of Cloud9 GCP Cloud IDE Azure Cloud IDE Advantages comparing with traditional Laptop/Workstation Powerful Disposable Preloaded Notebook-based: GPU + Jupyter Notebook AWS Sagemaker Studio Lab Google Colab Notebooks AWS has pre-built machine learning applications that can be accessed directly in CloudShell\nAdvanced text analytics Automated code reviews Chatbots Demand forecasting Document analysis Search Fraud prevention Image and video analysis \u0026hellip; ","date":"16 February 2022","externalUrl":null,"permalink":"/posts/2022/mlops-week11-aws-ml-technologies/","section":"Posts","summary":"\u003cp\u003eWeek 1-1 of the AWS MLops: Introduction\u003c/p\u003e","title":"Study notes: MLops Week 1-1 AWS Machine Learning Technologies","type":"posts"},{"content":"","date":"25 December 2021","externalUrl":null,"permalink":"/categories/math/","section":"Categories","summary":"","title":"Math","type":"categories"},{"content":" Normal distributions are a class of continuous probability distribution for random variables. They often show up in measurements of physical quantities. For example, the human height is normally distributed. In fact, it is claimed that variables in natural and social sciences are normally or approximately normally distributed. Weight, reading ability, test scores, blood pressure, \\(\\ldots\\)\nThe underlying reason for this is partly due to the central limit theorem. The theorem says that the average of many observations of a random variable with finite average and variance is also a random variable whose distribution converges to a normal distribution as the number of observations increases. This conclusion has a very important implication for Monte Carlo methods which I will discuss in another post.\nIn the simplest case (zero mean and unit variance), the distribution is written as \\begin{equation} f(x) = \\frac{1}{\\sqrt{2\\pi}}e^{-\\frac{x^2}{2}}. \\nonumber \\end{equation} If we plot the function \\(f(x)\\), the curve looks like the following\nThe prefactor \\(1/\\sqrt{2\\pi}\\) of \\(f(x)\\) is its normalization constant which guarantees that the integration of \\(f(x)\\) over the entire real axis is unity \\begin{equation} \\int_{-\\infty}^\\infty f(x)dx = 1. \\nonumber \\end{equation} We all know that the mathematical constant \\(\\pi\\) is the ratio of a circle\u0026rsquo;s circumference to its diameter. This brings up an interesting question: why does \\(\\pi\\) show up in probability distributions that seemingly have nothing to do with circles?\nThe reason, it turns out, boils down to how the normalization of a normal distribution is calculated. A normal distribution belongs to a type of function called the Gaussian function \\(g(x) = e^{-x^2}\\). Its integral is \\begin{equation} \\int_{-\\infty}^\\infty e^{-x^2}dx = \\sqrt{\\pi}.\\nonumber \\end{equation}\nStandard approach # A textbook approach for getting the normalization constant is to compute the square of the Gaussian integral \\begin{equation} \\left(\\int_{-\\infty}^\\infty e^{-x^2}dx\\right)^2 = \\int_{-\\infty}^\\infty\\int_{-\\infty}^\\infty e^{-(x^2+y^2)}dxdy. \\nonumber \\end{equation} In this way, we can leverage the coordinate transformation to map the Cartesian coordinates \\((x,y)\\) into polar coordinates \\((r,\\theta)\\) using the identity \\begin{equation} x^2+y^2 = r^2, \\nonumber \\end{equation} which is exactly an equation of a circle with radius \\(r\\). It is through this equation (transformation) that allows \\(\\pi\\) to have a place in the normal distribution. To continue, the square of the Gaussian integral now becomes\n\\begin{align} \\int_{-\\infty}^\\infty\\int_{-\\infty}^\\infty e^{-(x^2+y^2)}dxdy \u0026amp;= \\int_0^{2\\pi}\\int_0^\\infty e^{-r^2} \\,r drd\\theta \\nonumber \\\\\\ \u0026amp;= 2\\pi\\int_0^\\infty \\frac{1}{2} e^{-y} dy,\\,\\,\\,\\,\\,\\,\\text{where}\\,\\,\\, y=r^2 \\nonumber\\\\\\ \u0026amp;= \\pi. \\nonumber \\end{align}\nAs a result, the normalization of the Gaussian function is \\(\\sqrt{\\pi}\\).\nContour integration # Of course, the method described above is not the only way to normalize the Gaussian integral. There are many proofs ranging from differentiation of an integral, volume integration, \\(\\Gamma\\)-function, asymptotic approximation, Stirling\u0026rsquo;s formula, Laplace\u0026rsquo;s original proof, and the residue theorem.\nSpeaking about the residue theorem, it is a powerful tool for evaluating integrals. The theorem generalizes the Cauchy\u0026rsquo;s theorem and says that the integral of an analytic function \\(f(z)\\) around a closed contour \\(C\\) depends only on the properties of a few special points (singularities, or poles) inside the contour.\n\\begin{equation} \\int_C f(z) dz = 2\\pi i\\sum_{j=1}^n R_j, \\nonumber \\end{equation}\nwhere \\(R_j\\), one of those special points, is called the residue at the point \\(z_j\\):\n\\begin{equation} R_j = \\frac{1}{2\\pi i}\\oint_{C_j} f(z) dz. \\nonumber \\end{equation}\nSo the residue theorem by definition has a prefactor that contains \\(\\pi\\). However, if we want to apply the theorem to the Gaussian function, we would hit a wall. This is because the complex Gaussian function \\(g(z) = e^{-z^2}\\) has no singularities on the entire complex plane. This property has bothered many and in 1914 a mathematician G. N. Watson said in his textbook Complex Integration and Cauchy\u0026rsquo;s Theorem that\nCauchy’s theorem cannot be employed to evaluate all definite integrals; thus \\(\\displaystyle\\int_0^\\infty e^{-x^2}dx\\) has not been evaluated except by other methods.\nFinally in 1940s, several proofs were published using the residue theorem. However there proofs are based on awkward contours and analytic functions that seem to come out of nowhere. For example, Kneser has used the following definition 1 $$ \\frac{e^{-z^2/2}}{1-e^{-\\sqrt{\\pi}(1+i)z}}. $$ Basically the idea is to construct an analytic function that looks like the Gaussian but with poles so that the residue theorem can be applied.\nWhile I like the beauty and power of the residue theorem, the complex analysis proof just seems too complex and artificial for my taste. The standard textbook proof is simple and incorporates naturally the reason why $\\pi$ is there in the normalization. Anyway, I found this little fact a bit interesting.\nH. Kneser, Funktionentheorie, Vandenhoeck and Ruprecht, 1958.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","date":"25 December 2021","externalUrl":null,"permalink":"/posts/2021/gaussial-integral/","section":"Posts","summary":"\u003cp\u003e\n\nNormal distributions are a class of continuous probability distribution for random variables.\nThey often show up in measurements of physical quantities. For example,\n\u003ca href=\"https://ourworldindata.org/human-height#height-is-normally-distributed\" target=\"_blank\"\u003e\nthe human height is normally distributed\u003c/a\u003e. In fact, it is claimed that variables in natural\nand social sciences are normally or approximately normally distributed. Weight, reading\nability, test scores, blood pressure, \\(\\ldots\\)\u003c/p\u003e","title":"Normal distributions and \\\\(\\pi\\\\)","type":"posts"},{"content":"Revisited the Orion Nebula M42. This time with the mighty (comparing to to WO Z61 MK2) Takahashi TSA-120N. M42 is a very bright nebula so it is fairly straightforward to image overall. The most tricky part is probably the Trapezium region where care must be taken in order to prevent overexposure. I managed to get a few hours of LRGB data before the clouds kicked in. Applied a PixInsight HDR trick I learned from the internet to process the Trapezium cluster region.\nIntegrated image # Telescope: Takahashi TSA-120N Camera: ZWO ASI294MM Pro Mount: iOptron CEM40 Filters: ZWO LRGB 31mm Post Processing: Astro Pixel Processor + PixInsight Total Integration time: 2h 30' Check out the photo on my Astrobin\n","date":"5 November 2021","externalUrl":null,"permalink":"/posts/2021/11-05-orion/","section":"Posts","summary":"\u003cp\u003eRevisited the Orion Nebula M42. This time with the mighty (comparing to to WO Z61 MK2) Takahashi TSA-120N.\nM42 is a very bright nebula so it is fairly straightforward to image overall. The most tricky part is probably\nthe Trapezium region where care must be taken in order to prevent overexposure. I managed to get a few hours\nof LRGB data before the clouds kicked in. Applied a PixInsight HDR trick I learned from the internet to\nprocess the Trapezium cluster region.\u003c/p\u003e","title":"Orion Nebula","type":"posts"},{"content":"","externalUrl":null,"permalink":"/authors/","section":"Authors","summary":"","title":"Authors","type":"authors"},{"content":"","externalUrl":null,"permalink":"/series/","section":"Series","summary":"","title":"Series","type":"series"}] \ No newline at end of file diff --git a/posts/2021/11-05-orion/index.html b/posts/2021/11-05-orion/index.html index c250fd5..50b773e 100644 --- a/posts/2021/11-05-orion/index.html +++ b/posts/2021/11-05-orion/index.html @@ -4,7 +4,7 @@ -

Orion Nebula

·118 words·1 min
Author
fermion

Revisited the Orion Nebula M42. This time with the mighty (comparing to to WO Z61 MK2) Takahashi TSA-120N. +

Orion Nebula

·118 words·1 min
Author
fermion
Table of Contents

Revisited the Orion Nebula M42. This time with the mighty (comparing to to WO Z61 MK2) Takahashi TSA-120N. M42 is a very bright nebula so it is fairly straightforward to image overall. The most tricky part is probably the Trapezium region where care must be taken in order to prevent overexposure. I managed to get a few hours of LRGB data before the clouds kicked in. Applied a PixInsight HDR trick I learned from the internet to diff --git a/posts/2021/gaussial-integral/index.html b/posts/2021/gaussial-integral/index.html index 52a8a9c..6b13360 100644 --- a/posts/2021/gaussial-integral/index.html +++ b/posts/2021/gaussial-integral/index.html @@ -4,7 +4,7 @@ -

Normal distributions and \\(\pi\\)

·734 words·4 mins
Author
fermion

Normal distributions are a class of continuous probability distribution for random variables. +

Normal distributions and \\(\pi\\)

·734 words·4 mins
Author
fermion
Table of Contents

Normal distributions are a class of continuous probability distribution for random variables. They often show up in measurements of physical quantities. For example, the human height is normally distributed. In fact, it is claimed that variables in natural and social sciences are normally or approximately normally distributed. Weight, reading diff --git a/posts/2022/02-28-horsehead/index.html b/posts/2022/02-28-horsehead/index.html index fe755d6..7b8db7e 100644 --- a/posts/2022/02-28-horsehead/index.html +++ b/posts/2022/02-28-horsehead/index.html @@ -4,7 +4,7 @@ -

Horsehead Nebula

·119 words·1 min
Author
fermion

IC434 Horsehead Nebula. This nebula is probably one of the most imaged subject in astrophotography. +

Horsehead Nebula

·119 words·1 min
Author
fermion
Table of Contents

IC434 Horsehead Nebula. This nebula is probably one of the most imaged subject in astrophotography. IC434 is a dark nebula in the constellation Orion. It is about 1375 light years away from Earth. The nebula got its name because the dark dust cloud looks just like a horse’s head. Behind the Horsehead nebula is a huge area of ionized hydrogen gas lit up by the star group Sigma Orions diff --git a/posts/2022/03-04-monkey/index.html b/posts/2022/03-04-monkey/index.html index 4d4f6ef..eca6325 100644 --- a/posts/2022/03-04-monkey/index.html +++ b/posts/2022/03-04-monkey/index.html @@ -4,7 +4,7 @@ -

NGC2175, Monkey Head Nebula

·99 words·1 min
Author
fermion

SH2-252 Monkey Head Nebula. Located in the Orion Constellation. Embedded in the nebula is an open +

NGC2175, Monkey Head Nebula

·99 words·1 min
Author
fermion
Table of Contents

SH2-252 Monkey Head Nebula. Located in the Orion Constellation. Embedded in the nebula is an open cluster NGC 2175. The cluster is about 6350 light years from Earth. The nebula has active ionized oxygen, sulfur, and hydrogen clouds. A very good narrowband subject. Does the nebula look like a monkey head? I can’t really tell…😂

Integrated image
#

-

IC1396 Elephan's Trunk Nebula

·143 words·1 min
Author
fermion

Elephant’s Trunk and the IC1396. Standard SHO narrow band image. Elephant’s trunk (the dark cloud in the left) is an area of interstellar gas and dust in the large ionized gas region IC1396. The area is located in the constellation Cepheus, about 2400 light years away from Earth.

The entire area is very dynamic. The IC1396 is being illuminated and ionized by a bright and massive star HD 206267 seen near the center of the frame. The Elephant’s Trunk region is currently thought to be a star formation site containing many young stars (less than 100,000 years old).

Integrated image
#

IC1396 Elephan's Trunk Nebula

·143 words·1 min
Author
fermion
Table of Contents

Elephant’s Trunk and the IC1396. Standard SHO narrow band image. Elephant’s trunk (the dark cloud in the left) is an area of interstellar gas and dust in the large ionized gas region IC1396. The area is located in the constellation Cepheus, about 2400 light years away from Earth.

The entire area is very dynamic. The IC1396 is being illuminated and ionized by a bright and massive star HD 206267 seen near the center of the frame. The Elephant’s Trunk region is currently thought to be a star formation site containing many young stars (less than 100,000 years old).

Integrated image
#

  • Telescope: Takahashi FSQ-85EDX
  • Camera: ZWO ASI294MM Pro
  • Mount: iOptron CEM40
  • Filters: Astronomik H-alpha 6nm 1.25" + SII 6nm 1.25" + OIII 6nm 1.25"
  • Post Processing: Astro Pixel Processor + PixInsight
  • Total Integration time: 15h 35'

Check out the photo on my Astrobin

M13 Cluster

·88 words·1 min
Author
fermion

M13, the Great Globular Cluster. Located in the constellation of Hercules., it is roughly 27,100 light +

M13 Cluster

·88 words·1 min
Author
fermion
Table of Contents

M13, the Great Globular Cluster. Located in the constellation of Hercules., it is roughly 27,100 light years away from Earth. The cluster is the brightest globular cluster in the northern hemisphere, visible to the naked eye under dark skies. My first astrophotography image taken with the Takahashi FSQ-85EDX, also called the Baby-Q.

Integrated image
#

-

Squid Nebula

·325 words·2 mins
Author
fermion

SH2-129 Flying Bat (red) and OU-4 Squid Nebulae. The combo can be found in the constellation Cepheus. +

Squid Nebula

·325 words·2 mins
Author
fermion
Table of Contents

SH2-129 Flying Bat (red) and OU-4 Squid Nebulae. The combo can be found in the constellation Cepheus. Both are in the family of emission nebula. The Flying Bat consists of mostly ionized hydrogen gas while the Squid is a region full of ionized oxygen clouds. Due to its extremely dimness, the Squid was discovered only recently in 2011 by French astrophotographer Nicolas Outters. The formation of diff --git a/posts/2022/11-04-veil/index.html b/posts/2022/11-04-veil/index.html index d3ae12e..aa6489b 100644 --- a/posts/2022/11-04-veil/index.html +++ b/posts/2022/11-04-veil/index.html @@ -4,7 +4,7 @@ -

Veil Nebula

·95 words·1 min
Author
fermion

The Veil Nebula. This is a 2x3 Mosaic narrow-band imaging project that took me almost two months (9/4/2022 ~ 11/1/2022) +

Veil Nebula

·95 words·1 min
Author
fermion
Table of Contents

The Veil Nebula. This is a 2x3 Mosaic narrow-band imaging project that took me almost two months (9/4/2022 ~ 11/1/2022) to complete. A total of 2155 subs with 179 hours of accumulated exposure time. This nebula is a magnificent supernova remnant in the area of Cygnus, one of my favorite constellations.

Integrated image
#

-

Soul Nebula

·95 words·1 min
Author
fermion

Westerhout 5, or known by most as Soul Nebula, is an emission nebula located in Cassiopeia. +

Soul Nebula

·95 words·1 min
Author
fermion
Table of Contents

Westerhout 5, or known by most as Soul Nebula, is an emission nebula located in Cassiopeia. This is a large star forming region like the Orion Nebula. The dark hollow areas embedded in the blue regions are cavities that were carved out by radiation and winds from the region’s massive stars.

Integrated image
#

-

Study notes: MLops Week 1-1 AWS Machine Learning Technologies

·173 words·1 min
Author
fermion

Week 1-1 of the AWS MLops: Introduction

AWS Sagemaker Studio Lab
#

  • Free environment for prototyping out machine learning projects
  • Based on Jupyter Notebook lab
  • Supports two compute types: CPU and GPU
  • Good integration with Hugging Face
  • Offers terminal window, allowing the access of AWS resources through the aws command

AWS CloudShell
#

  • No credentials to manage because of the role-based privileges

  • Convenient file upload/download GUI

  • Easy access to AWS resources such as S3 via the aws command

    • File transfer between CloudShell and S3 buckers
    • File synchronization between CloudShell and S3 buckets
  • Cloud Developer Workspace

    • Various venders:
      • Github CodeSpaces - Easy integration with Github services
      • AWS Cloud9, AWS CloudShell is a lightweight version of Cloud9
      • GCP Cloud IDE
      • Azure Cloud IDE
    • Advantages comparing with traditional Laptop/Workstation
      • Powerful
      • Disposable
      • Preloaded
    • Notebook-based: GPU + Jupyter Notebook
      • AWS Sagemaker Studio Lab
      • Google Colab Notebooks
  • AWS has pre-built machine learning applications that can be accessed directly in CloudShell

    • Advanced text analytics
    • Automated code reviews
    • Chatbots
    • Demand forecasting
    • Document analysis
    • Search
    • Fraud prevention
    • Image and video analysis

Study notes: MLops Week 1-1 AWS Machine Learning Technologies

·173 words·1 min
Author
fermion
Table of Contents

Week 1-1 of the AWS MLops: Introduction

AWS Sagemaker Studio Lab
#

  • Free environment for prototyping out machine learning projects
  • Based on Jupyter Notebook lab
  • Supports two compute types: CPU and GPU
  • Good integration with Hugging Face
  • Offers terminal window, allowing the access of AWS resources through the aws command

AWS CloudShell
#

  • No credentials to manage because of the role-based privileges

  • Convenient file upload/download GUI

  • Easy access to AWS resources such as S3 via the aws command

    • File transfer between CloudShell and S3 buckers
    • File synchronization between CloudShell and S3 buckets
  • Cloud Developer Workspace

    • Various venders:
      • Github CodeSpaces - Easy integration with Github services
      • AWS Cloud9, AWS CloudShell is a lightweight version of Cloud9
      • GCP Cloud IDE
      • Azure Cloud IDE
    • Advantages comparing with traditional Laptop/Workstation
      • Powerful
      • Disposable
      • Preloaded
    • Notebook-based: GPU + Jupyter Notebook
      • AWS Sagemaker Studio Lab
      • Google Colab Notebooks
  • AWS has pre-built machine learning applications that can be accessed directly in CloudShell

    • Advanced text analytics
    • Automated code reviews
    • Chatbots
    • Demand forecasting
    • Document analysis
    • Search
    • Fraud prevention
    • Image and video analysis

Study notes: MLops Week 1-2 Data Repositories

·475 words·3 mins
Author
fermion

Week 1-2 of the AWS MLops: Data repositories and AWS managed storage choices

Introduction to AWS S3
#

  • Object storage service that offers scalability, data availability, security, and performance.

    • 99.99999999999% of durability
    • Easy to use management features
    • Can respond to event triggers
  • Use cases:

    • Content storage/distribution
    • Backup, restore, and archive
    • Data lakes and big data analytics
    • Disaster recovery
    • Static website hosting
  • Component:

    • Bucket: https://s3-<aws-region>.amaonaws.com/<bucket-name>/
    • Object: https://s3-<aws-region>.amaonaws.com/<bucket-name>/<object-key>
    • Objects in an S3 bucket can be referred by their URL
    • The key value identifies the object in the bucket
  • Prefixes:

    • Use prefixes to imply a folder structure in an S3 bucket
      • Specify prefix: 2021/doc-example-bucket/math
      • Returns the following kets:
        • 2021/doc-example-bucket/math/john.txt
        • 2021/doc-example-bucket/math/maris.txt
  • Object metadata:

    • System-defined:
      • objection creation data
      • object size
      • object version
    • User-defined:
      • information that you assign to the object
      • x-amz-meta key followed by a custom name. Example: x-amz-meta-alt-name
  • Versioning:

    • Keep multiple variants of an object in the same bucket
    • In versioning-enabled S3 buckets, each object has a version ID
    • After versioning is enabled, it can only be suspended.
  • Three operations:

    • PUT:
      • Upload entire object to a bucket. Max size: 5 GB
      • Should use multipart upload for objects over 100 MB
      import boto3
      +

Study notes: MLops Week 1-2 Data Repositories

·475 words·3 mins
Author
fermion
Table of Contents

Week 1-2 of the AWS MLops: Data repositories and AWS managed storage choices

Introduction to AWS S3
#

  • Object storage service that offers scalability, data availability, security, and performance.

    • 99.99999999999% of durability
    • Easy to use management features
    • Can respond to event triggers
  • Use cases:

    • Content storage/distribution
    • Backup, restore, and archive
    • Data lakes and big data analytics
    • Disaster recovery
    • Static website hosting
  • Component:

    • Bucket: https://s3-<aws-region>.amaonaws.com/<bucket-name>/
    • Object: https://s3-<aws-region>.amaonaws.com/<bucket-name>/<object-key>
    • Objects in an S3 bucket can be referred by their URL
    • The key value identifies the object in the bucket
  • Prefixes:

    • Use prefixes to imply a folder structure in an S3 bucket
      • Specify prefix: 2021/doc-example-bucket/math
      • Returns the following kets:
        • 2021/doc-example-bucket/math/john.txt
        • 2021/doc-example-bucket/math/maris.txt
  • Object metadata:

    • System-defined:
      • objection creation data
      • object size
      • object version
    • User-defined:
      • information that you assign to the object
      • x-amz-meta key followed by a custom name. Example: x-amz-meta-alt-name
  • Versioning:

    • Keep multiple variants of an object in the same bucket
    • In versioning-enabled S3 buckets, each object has a version ID
    • After versioning is enabled, it can only be suspended.
  • Three operations:

    • PUT:
      • Upload entire object to a bucket. Max size: 5 GB
      • Should use multipart upload for objects over 100 MB
      import boto3
       
       S3API = boto3.client("s3", region_name="us-east-1")
       bucket_name = "samplebucket"
      diff --git a/posts/2022/mlops-week13-data-ingestion/index.html b/posts/2022/mlops-week13-data-ingestion/index.html
      index 0ce9bf9..a47d7a4 100644
      --- a/posts/2022/mlops-week13-data-ingestion/index.html
      +++ b/posts/2022/mlops-week13-data-ingestion/index.html
      @@ -4,7 +4,7 @@
       
      -

Study notes: MLops Week 1-3 Data Ingestion and Transformation

·329 words·2 mins
Author
fermion

Week 1-3 of the AWS MLops: Data ingestion and AWS jobs

AWS job styles:
#

  • Batch
    • Glue: creates metadata that allows to perform operations on e.g. S3 or a database. This is a serverless +

Study notes: MLops Week 1-3 Data Ingestion and Transformation

·329 words·2 mins
Author
fermion
Table of Contents

Week 1-3 of the AWS MLops: Data ingestion and AWS jobs

AWS job styles:
#

  • Batch
    • Glue: creates metadata that allows to perform operations on e.g. S3 or a database. This is a serverless ETL system.
    • Batch: general purpose batch, can process anything at scale in containers and training models with GPUs.
    • Step functions: parameterize different steps, orchestrate Lambda functions with inputs.
  • Streaming
    • Kinesis: send in small payloads and process them as it receives the payloads.
    • Kafka via Amazon MSK

In terms of operation complexity and data size, here is a high level comparison

BatchStreaming
complexitysimplecomplex
data sizelargesmall
  • Complexity: Batch jobs are simpler, they receive data, execute operations across, then give back results. Streaming jobs on the other hand need to take in data as they come in and a bit more prone to error and mistake.
  • Data size: Batch jobs are good at handling large data payloads since they are designed to process in batch. While streaming jobs process things as they come in.

Batch: data ingestion and processing pipelines
#

-

Study notes: MLops Week 2 AWS ML Data Preparation

·414 words·2 mins
Author
fermion

Week 2 of the AWS MLops: Data preparation using AWS services

Machine Learning (ML) and AWS ML Services
#

  • Deep learning $\subset$ Machine learning $\subset$ Artificial intelligence

  • ML is the scientific study of algorithms and statistical models to perform a task using inference instead of instructions

    Typical workflow: Data → Model → Prediction

  • Types of ML algorithms and common business use cases

    • Supervised learning:
      • Fraud detection
      • Image recognition - computer vision
      • Customer retention
      • Medical diagnostics - computer vision
      • Personalized advertising
      • Sales prediction
      • Weather forecasting
      • Market projection
      • Population growth prediction
      • $\ldots$
    • Unsupervised learning:
      • Product recommendations
      • Customer segmentation
      • Targeted marketing
      • Medical diagnostics
      • Natural language processing - chatbot, translation, sentiment analysis
      • Data structure discovery
      • Gene sequencing
      • $\ldots$
    • Reinforcement learning: best when the desired outcome is known but the exact path to achieving it is not known
      • Game AI
      • Self-driving cats
      • Robotics
      • Customer service routing
      • $\ldots$
  • Use ML when you have:

    • Large datasets, large number of variables
    • Lack of clear procedures to obtain solutions
    • Existing ML expertise
    • Infrastructure already in place to support ML
    • Management support for ML
  • Typical ML workflow

Study notes: MLops Week 2 AWS ML Data Preparation

·414 words·2 mins
Author
fermion
Table of Contents

Week 2 of the AWS MLops: Data preparation using AWS services

Machine Learning (ML) and AWS ML Services
#

  • Deep learning $\subset$ Machine learning $\subset$ Artificial intelligence

  • ML is the scientific study of algorithms and statistical models to perform a task using inference instead of instructions

    Typical workflow: Data → Model → Prediction

  • Types of ML algorithms and common business use cases

    • Supervised learning:
      • Fraud detection
      • Image recognition - computer vision
      • Customer retention
      • Medical diagnostics - computer vision
      • Personalized advertising
      • Sales prediction
      • Weather forecasting
      • Market projection
      • Population growth prediction
      • $\ldots$
    • Unsupervised learning:
      • Product recommendations
      • Customer segmentation
      • Targeted marketing
      • Medical diagnostics
      • Natural language processing - chatbot, translation, sentiment analysis
      • Data structure discovery
      • Gene sequencing
      • $\ldots$
    • Reinforcement learning: best when the desired outcome is known but the exact path to achieving it is not known
      • Game AI
      • Self-driving cats
      • Robotics
      • Customer service routing
      • $\ldots$
  • Use ML when you have:

    • Large datasets, large number of variables
    • Lack of clear procedures to obtain solutions
    • Existing ML expertise
    • Infrastructure already in place to support ML
    • Management support for ML
  • Typical ML workflow

    workflow

    • Iterative process
      • Data processing
      • Training
      • Evaluation
  • ML frameworks and infrastructure

    • Frameworks provide tools and code libraries
      • Customized scripting
      • Integration with AWS services
      • Community of developers
      • Example: PyTorch, TensorFlow, scikit-learn, $\ldots$
    • Infrastructure
      • Designed for ML applications
      • AWS IoT Greengrass provides an infrastructure for building ML for IoT devices
      • AWS Elastic Inference reduces costs for running ML apps
  • AWS ML managed services, no ML experience required

    • Computer vision: Amazon Rekognition, Amazon Textract
    • Speech: Amazon Polly, Amazon Transcribe
    • Language: Amazon Comprehend, Amazon Translate
    • Chabots: Amazon Lex
    • Forecasting: Amazon Forecast
    • Recommendations: Amazon Personalize
  • Three layers of the Amazon Machine Learning stack:

    • Managed Services
    • Machine Learning Services
    • MAchine Learning Frameworks
  • ML challenges

    • Data
      • Poor quality
      • Non-representative
      • Insufficient
      • Overfitting and underfitting
    • Business
      • Complexity in formulating questions
      • Explaining models to business stakeholders
      • Cost of building systems
    • Users
      • Lack of data science expertise
      • Cost of staffing with data scientists
      • Lack of management support
    • Technology
      • Data privacy issue
      • Tool selection can be complicated
      • Integration with other systems

Feature Engineering
#

  • Public dataset for feature engineering and model tuning

    • Hugging Face public datasets
    • Kaggle public datasets
    • Amazon S3 buckets
  • An useful concept: combine old features and produce new features for training/validation

  • Helpful to create a ML project structure so that the project can be managed and tracked phase by phase

    • Data ingest
    • Exploratory data analysis (EDA)
    • Modeling
    • Conclusion
  • At the EDA phase, typical approaches

    • Look at descriptive statistics
    • Graphing data, examine trends: linear, logarithmic, $\ldots$
    • Clustering data

Study notes: MLops Week 3-1 Machine Learning pipeline

·845 words·4 mins
Author
fermion

Week 3-1 of the AWS MLops: ML steps, pipeline, and AWS SageMaker

Machine Learning Pipeline and AWS SageMaker
#

Forming business problem
#

  • Define business objective, questions to ask:

    • How is this task done today?
    • How will the business measure success?
    • How will the solution be used?
    • Do similar solutions exist?
    • What are the assumptions?
    • Who are the domain experts?
  • Frame the business problem

    • Is it a machine learning problem? What kind of ML problem is it? Classification or regression?
    • Is the problem supervised or unsupervised?
    • What is the target to predict?
    • Have access to the relevant data?
    • What is the minimum or baseline performance?
    • Would you solve the problem manually?
    • What is the simplest solution?
    • Can the success be measured?

Collect and secure data, ETL
#

  • Data sources

    • Private data
    • Commercial data: AWS Data Exchange, AWS Marketplace, $\ldots$
    • Open-source data
      • Kaggle
      • World Health Organization
      • US Census Bureau
      • National Oceanic and Atmospheric Administration
      • UC Irvine Machine Learning repository
      • AWS
    • Critical question: Is you data representative?
  • ETL with AWS Glue

    • Runs the ETL process
    • Crawls data sources to create catalogs that can be queried
    • ML functionality
    • AWS Glue can glue together different datasets and emit a single endpoint that can be queried

Study notes: MLops Week 3-1 Machine Learning pipeline

·845 words·4 mins
Author
fermion
Table of Contents

Week 3-1 of the AWS MLops: ML steps, pipeline, and AWS SageMaker

Machine Learning Pipeline and AWS SageMaker
#

Forming business problem
#

  • Define business objective, questions to ask:

    • How is this task done today?
    • How will the business measure success?
    • How will the solution be used?
    • Do similar solutions exist?
    • What are the assumptions?
    • Who are the domain experts?
  • Frame the business problem

    • Is it a machine learning problem? What kind of ML problem is it? Classification or regression?
    • Is the problem supervised or unsupervised?
    • What is the target to predict?
    • Have access to the relevant data?
    • What is the minimum or baseline performance?
    • Would you solve the problem manually?
    • What is the simplest solution?
    • Can the success be measured?

Collect and secure data, ETL
#

  • Data sources

    • Private data
    • Commercial data: AWS Data Exchange, AWS Marketplace, $\ldots$
    • Open-source data
      • Kaggle
      • World Health Organization
      • US Census Bureau
      • National Oceanic and Atmospheric Administration
      • UC Irvine Machine Learning repository
      • AWS
    • Critical question: Is you data representative?
  • ETL with AWS Glue

    • Runs the ETL process
    • Crawls data sources to create catalogs that can be queried
    • ML functionality
    • AWS Glue can glue together different datasets and emit a single endpoint that can be queried

    aws glue

  • Data security: Access control and Data encryption

    • Control access using AWS Identity and Access Management (IAM) policy
    • AWS S3 default encryption
    • AWS RDS encryption
    • AWS CloudTrail: tracks user activity and application programing interface (API) usage

Data evaluation
#

  • Make sure the data is in the correct format
  • Use descriptive statistics to gain insights into the dataset before cleaning the data
    • Overall statistics
      • Categorical statistics can identify frequency of values and class imbalances
    • Multivariate statistics
      • Scatter plot to inspect the correlation between two variables
      • pandas provides scatter_matrix method to examine multivariate correlations
      • Correlation matrix and heat map
    • Attribute statistics

Feature engineering
#

  • Feature extraction

    • Data encoding

      • Categorical data must be converted to a numeric scale
      • If data is non-ordinal, the encoded value must be non-ordinal which might need to be broken into multiple categories
    • Data cleaning

      • Variations in strings: text standardization
      • Variations in scale: scale normalization
      • Columns with multiple data items: parse into multiple columns
      • Missing data:
        • Cause of missing data:
          • undefined values
          • data collection errors
          • data cleaning errors
        • Plan for missing data: ask the following questions first
          • What were the mechanisms causing the missing data?
          • Are values missing at random?
          • Are rows or columns missing that you are not aware of?
        • Standard approaches
          • Dropping missing data
          • Imputing missing data
      • Outliers
        • Finding the outliers: box plots or scatter plots for visualization
        • Dealing with outliers:
          • Delete - e.g. outliers were created by artificial errors
          • Transform - reduce the variation
          • Impute - e.g. use mean for the outliers
  • Feature selection

    • Filter method

      • Pearson’s correlation
      • Linear discriminant analysis (LDA)
      • Analysis of variance (ANOVA)
      • Chi-square $\chi^2$ analysis
    • Wrapper method

      • Forward selection
      • Backward selection
      -

Study notes: MLops Week 3-2 Forecasting

·325 words·2 mins
Author
fermion

Week 3-2 of the AWS MLops: AWS Forecast and time-series data

Forecasting and AWS Forecast
#

Overview
#

  • Predicting future values that are based on historical data
  • Patterns include
    • Trends
    • Seasonal, pattern that is based on seasons
    • Cyclical, other repeating patterns
    • Irregular, patterns that might appear to be random
  • Examples
    • Sales and demand forecast
    • Energy consumption
    • Inventory projections
    • Weather forecast

Processing time series data
#

  • Time series data is captured in sequence over time

  • Handle missing data

    • Forward fill
    • Backward fill
    • Moving average
    • Interpolation: linear, spline, or polynomial
    • Sometimes zero is a good fill value
  • Reasmpling: Resampling time series data allows the flexibility of defining resolution of the data

    • Upsampling: increase the sample frequency, e.g. from minutes to seconds. Care must be taken +

Study notes: MLops Week 3-2 Forecasting

·325 words·2 mins
Author
fermion
Table of Contents

Week 3-2 of the AWS MLops: AWS Forecast and time-series data

Forecasting and AWS Forecast
#

Overview
#

  • Predicting future values that are based on historical data
  • Patterns include
    • Trends
    • Seasonal, pattern that is based on seasons
    • Cyclical, other repeating patterns
    • Irregular, patterns that might appear to be random
  • Examples
    • Sales and demand forecast
    • Energy consumption
    • Inventory projections
    • Weather forecast

Processing time series data
#

  • Time series data is captured in sequence over time

  • Handle missing data

    • Forward fill
    • Backward fill
    • Moving average
    • Interpolation: linear, spline, or polynomial
    • Sometimes zero is a good fill value
  • Reasmpling: Resampling time series data allows the flexibility of defining resolution of the data

    • Upsampling: increase the sample frequency, e.g. from minutes to seconds. Care must be taken in deciding how the fine-grained samples are computed.
    • Downsampling: decrease the sample frequency, e.g. from days to months. Need to pay attention to how the aggregation is carried out.
    • Reasons for resampling:
      • Inspect the behavior of data under different resolutions
      • Join tables with different resolutions
  • Sampling smoothing, including outlier removal

    • Why
      • Part of the data preparation process
      • For visualization
    • How does smoothing affect the outcome
      • Cleaner data to model
      • Model compatibility
      • Production improvement?
  • Seasonality

    • Hourly, daily, quarterly, yearly
    • Spring, summer, fall, winter
    • Holidays
  • Time series sample correlations

    • Stationary
      • How stable is the system
      • Does the past inform the future
    • Trends
      • Correlation issues
    • Autocorrelation
      • How points in time series sample are linearly related
  • pandas offer many methods for handling time series data

    • Time-aware index
    • groupby and resample()
    • autocorr() method
  • Times series algorithms offered by Amazon Forecast

    • ARIMA, autoregressive integrated moving average
    • DeepAR+
    • Exponential Smoothing (ETS)
    • Non-Parametric Time Series (NPTS)
    • Prophet
  • Model evaluation

    • Time series data model training cannot use $k$-fold cross validation because the data is ordered and correlated.
    • Standard approach: back testing
    -

Study notes: MLops Week 3-3 Computer Vision

·292 words·2 mins
Author
fermion

Week 3-3 of the AWS MLops: Computer vision and AWS Rekognition

Computer Vision and Amazon Rekognition
#

Computer vision
#

  • Automated extraction of information from digital images
  • Applications
    • Public safety and home security
    • Authentication and enhanced computer-human interaction
    • Content management and analysis
    • Autonomous driving
    • Medial imaging
    • Manufacturing process control
  • Computer vision problems:
    • Image analysis
      • Object classification
      • Object detection
      • Object segmentation
    • Video analysis
      • Instance tracking, pathing
      • Action recognition
      • Motion estimation

Amazon Rekognition
#

  • Managed service for image and video analysis

  • Types of analysis:

    • Searchable image and video libraries
    • Face-based user verification
    • Sentiment and demographic analysis
    • Unsafe content detection
  • Can add powerful visual analysis to application

  • Highly scalable and continuously learns

  • Integrates with other AWS services

  • Examples:

    • Searchable image library

Study notes: MLops Week 3-3 Computer Vision

·292 words·2 mins
Author
fermion
Table of Contents

Week 3-3 of the AWS MLops: Computer vision and AWS Rekognition

Computer Vision and Amazon Rekognition
#

Computer vision
#

  • Automated extraction of information from digital images
  • Applications
    • Public safety and home security
    • Authentication and enhanced computer-human interaction
    • Content management and analysis
    • Autonomous driving
    • Medial imaging
    • Manufacturing process control
  • Computer vision problems:
    • Image analysis
      • Object classification
      • Object detection
      • Object segmentation
    • Video analysis
      • Instance tracking, pathing
      • Action recognition
      • Motion estimation

Amazon Rekognition
#

  • Managed service for image and video analysis

  • Types of analysis:

    • Searchable image and video libraries
    • Face-based user verification
    • Sentiment and demographic analysis
    • Unsafe content detection
  • Can add powerful visual analysis to application

  • Highly scalable and continuously learns

  • Integrates with other AWS services

  • Examples:

    • Searchable image library
    Wrapper method
    </figure>
    diff --git a/posts/2022/mlops-week4/index.html b/posts/2022/mlops-week4/index.html
    index 2c32b51..767f3e4 100644
    --- a/posts/2022/mlops-week4/index.html
    +++ b/posts/2022/mlops-week4/index.html
    @@ -4,7 +4,7 @@
     
    -

Study notes: MLops Week 4

·518 words·3 mins
Author
fermion

Quick notes of AWS MLops, Week 4

Talk about availability, scalability, and resilience
#

  • Monitoring and Logging

    • View point: Data science for Software system
    • AWS CloudWatch
      • Dashboards
      • Search
      • Alerts
      • Automated insights
    • Use CloudWatch to pull in info from servers hosting source codes and monitoring agents +

Study notes: MLops Week 4

·518 words·3 mins
Author
fermion
Table of Contents

Quick notes of AWS MLops, Week 4

Talk about availability, scalability, and resilience
#

  • Monitoring and Logging

    • View point: Data science for Software system
    • AWS CloudWatch
      • Dashboards
      • Search
      • Alerts
      • Automated insights
    • Use CloudWatch to pull in info from servers hosting source codes and monitoring agents (e.g. CPU metrics, Memory metrics, and DISK I/O metrics)
  • Multiple Regions

    • Resources are distributed across isolated geographic regions which have multiple availability zones
      • Create as many as redundant infrastructures
      • Increase resilience
  • Reproducible Workflows

    • Infrastructure as code (IAC) workflow: the idea behind infrastructure as code is that there isn’t a human pushing buttons that make something happen. The code is triggered to built by events.
    - +
    +
    + + +
    + + Table of Contents + + +
    + + +
    +
    +
    @@ -796,7 +870,7 @@

    一週間に三回 - + -

Forecasting: Chap. 1

·534 words·3 mins
Author
fermion

Notes on Chapter 1 of Forecasting: Principles & Practice

What can be forecast
#

Factors that affect the predictability of an event or a quantity

  1. How well we understand the factors that contribute to it;
  2. How much data is available;
  3. How similar the future is to the past;
  4. Whether the forecasts can affect the thing we are trying to forecast.

Example 1: Short-term forecasts of residential electricity demand

  1. Temperature is a primary driving force of the demand, especially in summer.
  2. Historical electricity usage data is available.
  3. It’s safe to assume that the demand behavior will be similar to that in the past. +

Forecasting: Chap. 1

·534 words·3 mins
Author
fermion
Table of Contents

Notes on Chapter 1 of Forecasting: Principles & Practice

What can be forecast
#

Factors that affect the predictability of an event or a quantity

  1. How well we understand the factors that contribute to it;
  2. How much data is available;
  3. How similar the future is to the past;
  4. Whether the forecasts can affect the thing we are trying to forecast.

Example 1: Short-term forecasts of residential electricity demand

  1. Temperature is a primary driving force of the demand, especially in summer.
  2. Historical electricity usage data is available.
  3. It’s safe to assume that the demand behavior will be similar to that in the past. I.e., there is some degree of seasonality.
  4. Price of the electricity is not strongly dependent on the demand. So the forecast will not have too much effect on consumer behavior.

Example 2: Currency exchange rate

  1. We have very limited knowledge about what really contributes to exchange rates.
  2. There is indeed a lot of historical exchange rate data available.
  3. Very difficult to say that the future will be similar to the past. The market is very sensitive to a number of unpredictable factors such as political situation, diff --git a/posts/2024/06-30-forecasting-2/index.html b/posts/2024/06-30-forecasting-2/index.html index 2b51b67..1c2bcdd 100644 --- a/posts/2024/06-30-forecasting-2/index.html +++ b/posts/2024/06-30-forecasting-2/index.html @@ -1,10 +1,10 @@ Forecasting: Chap. 2 · Walking in the woods... -

    Forecasting: Chap. 2

    ·355 words·2 mins
    Author
    fermion

    Notes on Chapter 2 of Forecasting: Principles & Practice

    Time series patterns
    #

    • Trend: +

    Forecasting: Chap. 2

    ·562 words·3 mins
    Author
    fermion
    Table of Contents

    Notes on Chapter 2 of Forecasting: Principles & Practice

    Time series patterns
    #

    • Trend: Trend exists when there is a long-term increase or decrease in the data. The meaning of long or short is relative and depends on the data’s time scale. The trend does not need to be monotonic. It might go from an increasing trend to a decreasing one and vice versa. So long as the time scale of the tendency @@ -29,7 +29,21 @@ this period is not tied to any seasonal factor such as daily, weekly, or yearly and is much longer.

      Fig. 3 International Sunspot number. Credit: Forecasting: SpaceWeatherLive.com

  4. Deciphering seasonality through visualization
    #

    • Seasonal plots: +If the time series data exhibit seasonality, the pattern can be visualized by seasonal plots. These plots are +similar to the usual time series plots except that the data are plotted against the individual seasons, such +as sub-daily, daily, weekly, monthly, yearly, … etc.

    • Seasonal subseries plots: +This is a plot that emphasizes the seasonal patterns by showing the data in separate mini time series plots. +This type of plot is useful in identifying changes within particular seasons.

    Scatter plots
    #

    When studying the relationship between two time series (for example: electricity usage and temperature), it is useful +to plot one series against the other using the scatter plot.

    To further quantify the relationship, one can calculate the correlation coefficient to measure the strength of +the linear relationship between the two time series (variables). The correlation coefficient is defined as +$$ +r = \frac{\sum (x_t - \bar{x})(y_t - \bar{y})}{\sqrt{\sum(x_t - \bar{x})^2} \sqrt{\sum(y_t - \bar{y})^2}}. +$$ +By definition, \(-1 \leq r \leq 1\). +Note that \(r\) only measures the strength of linear correlation between two variables. It cannot quantify +higher order or more complex correlations. Therefore, one cannot rely solely on the correlation coefficient when +looking at the relationship between variables.

Forecasting

2024

Forecasting: Chap. 2
·355 words·2 mins
Forecasting: Chap. 1
·534 words·3 mins

© +

Forecasting

2024

Forecasting: Chap. 2
·562 words·3 mins
Forecasting: Chap. 1
·534 words·3 mins

© 2024 fermion

Powered by Hugo & Blowfish

\ No newline at end of file