diff --git a/gretel/gc-nlp_text_analysis/README.md b/gretel/gc-nlp_text_analysis/README.md index b2982365..cb834503 100644 --- a/gretel/gc-nlp_text_analysis/README.md +++ b/gretel/gc-nlp_text_analysis/README.md @@ -1,5 +1,5 @@ -# Work Safely with Sensitive Free Text Using Gretel +# Work Safely with Free Text Using Gretel -Using Gretel.ai's [NER and NLP features](https://gretel.ai/platform/data-catalog), we analyze and label chat logs looking for PII and other potentially sensitive information. After labeling the dataset, we build a transformation pipeline that will redact and replace any sensitive strings from chat messages. +Using Gretel.ai's [NER and NLP features](https://gretel.ai/platform/data-cataloghttps://gretel.ai/platform/data-catalog), we analyze and label a set of email dumps looking for PII and other potentially sensitive information. After labeling the dataset, we build a transformation pipeline that will redact and replace any sensitive strings from the email messages. -At the end of the notebook we'll have a dataset that is safe to share without compromising a user's personal information. +At the end of the notebook we'll have a dataset that is safe to share and analyze without compromising a user's personal information. \ No newline at end of file diff --git a/gretel/gc-nlp_text_analysis/blueprint.ipynb b/gretel/gc-nlp_text_analysis/blueprint.ipynb index 4e699263..057d6bfc 100644 --- a/gretel/gc-nlp_text_analysis/blueprint.ipynb +++ b/gretel/gc-nlp_text_analysis/blueprint.ipynb @@ -6,18 +6,18 @@ "metadata": {}, "outputs": [], "source": [ - "!pip install -Uqq spacy gretel-client # we install spacy for their visualization helper, displacy" + "!pip install -Uqq spacy gretel-client datasets # we install spacy for their visualization helper, displacy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "# Work Safely with Sensitive Free Text Using Gretel\n", + "# Work Safely with Free Text Using Gretel\n", "\n", - "Using Gretel.ai's [NER and NLP features](https://gretel.ai/platform/data-cataloghttps://gretel.ai/platform/data-catalog), we analyze and label chat logs looking for PII and other potentially sensitive information. After labeling the dataset, we build a transformation pipeline that will redact and replace any sensitive strings from chat messages.\n", + "Using Gretel.ai's [NER and NLP features](https://gretel.ai/platform/data-cataloghttps://gretel.ai/platform/data-catalog), we analyze and label a set of email dumps looking for PII and other potentially sensitive information. After labeling the dataset, we build a transformation pipeline that will redact and replace any sensitive strings from the email messages.\n", "\n", - "At the end of the notebook we'll have a dataset that is safe to share without compromising a user's personal information." + "At the end of the notebook we'll have a dataset that is safe to share and analyze without compromising a user's personal information." ] }, { @@ -34,6 +34,7 @@ "outputs": [], "source": [ "import pandas as pd\n", + "import datasets\n", "from gretel_client import get_cloud_client\n", "\n", "pd.set_option('max_colwidth', None)\n", @@ -47,7 +48,7 @@ "metadata": {}, "outputs": [], "source": [ - "client.install_packages()" + "client.install_packages(version=\"dev\")" ] }, { @@ -56,7 +57,7 @@ "source": [ "## Load the dataset\n", "\n", - "For this blueprint, we use a modified dataset from the Ubuntu Chat Corpus. It represents an archived set of IRC logs from Ubuntu's technical support channel. This data primarily contains free form text that we will pass through a NER pipeline for labeling and PII discovery." + "Using Hugging Face's [datasets](https://github.com/huggingface/datasets) library, we load a dataset containing a dump of [Enron emails](https://huggingface.co/datasets/aeslc). This data contains unstructured emails that we will pass through a NER pipeline for labeling and PII discovery." ] }, { @@ -65,7 +66,8 @@ "metadata": {}, "outputs": [], "source": [ - "source_df = pd.read_csv(\"https://gretel-public-website.s3.us-west-2.amazonaws.com/blueprints/nlp_text_analysis/chat_logs_sampled.csv\")" + "source_dataset = datasets.load_dataset(\"aeslc\")\n", + "source_df = pd.DataFrame(source_dataset[\"train\"]).sample(n=300, random_state=99)" ] }, { @@ -146,9 +148,9 @@ "source": [ "from gretel_helpers.spacy import display_entities\n", "\n", - "TEXT_FIELD = \"text\"\n", + "TEXT_FIELD = \"email_body\"\n", "\n", - "for record in project.iter_records(direction=\"backward\", record_limit=100):\n", + "for record in project.iter_records(direction=\"backward\", record_limit=5):\n", " display_entities(record, TEXT_FIELD)" ] }, @@ -228,7 +230,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Inspect the transformed version of the dataset." + "Inspect a transformed email from the dataset." ] }, { @@ -237,7 +239,20 @@ "metadata": {}, "outputs": [], "source": [ - "xf_df[[TEXT_FIELD]]" + "from gretel_client.demo_helpers import show_record_diff\n", + "\n", + "\n", + "# Lookup the comparison email by subject line.\n", + "c_key = \"subject_line\"\n", + "c_value = \"Confidentiality Agreement-Human Code\"\n", + "\n", + "# The comparison email contains multiple lines. For this\n", + "# demonstration we only want to examine the first line \n", + "# so we strip any extraneous newlines.\n", + "orig = source_df[source_df[c_key] == c_value][TEXT_FIELD].iloc[0].split(\"\\n\")[0]\n", + "xf = xf_df[xf_df[c_key] == c_value][TEXT_FIELD].iloc[0].split(\"\\n\")[0]\n", + "\n", + "show_record_diff({\"\": orig}, {\"\": xf})" ] }, {