Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Centralization of 3W Dataset in BibMon Toolkit: Data Loading and Structuring Functions #132

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
162 changes: 162 additions & 0 deletions overviews/_baseline/unify-data-tutorial.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 3W Dataset's General Presentation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is a general presentation of the 3W Dataset, to the best of its authors' knowledge, the first realistic and public dataset with rare undesirable real events in oil wells that can be readily used as a benchmark dataset for development of machine learning techniques related to inherent difficulties of actual data.\n",
"\n",
"For more information about the theory behind this dataset, refer to the paper **A Realistic and Public Dataset with Rare Undesirable Real Events in Oil Wells** published in the **Journal of Petroleum Science and Engineering** (link [here](https://doi.org/10.1016/j.petrol.2019.106223))."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1 Introduction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This Jupyter Notebook presents the 3W Dataset 2.0.0 in a general way. For this, some functionalities for data unification and the benefits of this process are demonstrated.\n",
"\n",
"In complex datasets like 3W, data is often distributed across multiple folders and files, which may hinder quick insights and analysis. The data unification process involves loading, cleaning, and merging these scattered files into a single, well-structured data frame. This process offers several key benefits:\n",
"\n",
"Functionalities of Data Unification\n",
"Automated Loading of Distributed Data:\n",
"\n",
"The notebook loads all Parquet files from multiple folders efficiently.\n",
"It filters out irrelevant files (e.g., simulated data) and extracts important metadata like timestamps directly from file names.\n",
"Data Normalization:\n",
"\n",
"Additional columns (e.g., folder ID, date, and time) are added, ensuring consistency across different data points.\n",
"This enhances downstream analysis by making sure that different segments are harmonized.\n",
"Handling Large-Scale Data with Dask:\n",
"\n",
"The use of Dask allows seamless processing of large datasets that would otherwise not fit into memory.\n",
"This makes it easier to explore and manipulate the entire dataset efficiently.\n",
"Benefits of Data Unification\n",
"Improved Data Accessibility:\n",
"With all data combined into a single structure, researchers and engineers can access relevant information faster, minimizing the time spent searching across files.\n",
"\n",
"Enhanced Analytical Capabilities:\n",
"Unified data allows for richer analytics, such as visualizing trends and patterns across the entire dataset. Anomalies and transient events can be identified and classified more accurately.\n",
"\n",
"Simplified Visualization:\n",
"By consolidating data into a single DataFrame, it's easier to generate comprehensive visualizations that provide meaningful insights about operational states.\n",
"\n",
"Facilitates Collaboration:\n",
"When datasets are standardized and merged, it becomes easier for teams to share their findings and collaborate on data-driven projects. The unified dataset serves as a single source of truth.\n",
"\n",
"This notebook demonstrates these functionalities and benefits by loading the 3W Dataset, classifying events across multiple operational states, and generating visualizations that offer a deeper understanding of system behavior."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2. Imports and Configurations"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import pandas as pd\n",
"import dask.dataframe as dd\n",
"import matplotlib.pyplot as plt\n",
"\n",
"# from toolkit.misc import load_and_combine_data, classify_events, visualize_data\n",
"\n",
"plt.style.use('ggplot') # Estilo do matplotlib\n",
"pd.set_option('display.max_columns', None) # Exibe todas as colunas do DataFrame\n",
"\n",
"dataset_dir = \"C:/Users/anabe/OneDrive/Área de Trabalho/HACKATHON PETROBRÁS/dataset_modificado/dataset_modificado\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 3. Instances' Structure"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this section, we explain the organization of the folders and files within the dataset. The 3W Dataset contains subfolders numbered from 0 to 9, where each folder represents a specific operational situation, as described below:\n",
"\n",
"* 0 = Normal Operation\n",
"* 1 = Abrupt Increase of BSW\n",
"* 2 = Spurious Closure of DHSV\n",
"* 3 = Severe Slugging\n",
"* 4 = Flow Instability\n",
"* 5 = Rapid Productivity Loss\n",
"* 6 = Quick Restriction in PCK\n",
"* 7 = Scaling in PCK\n",
"* 8 = Hydrate in Production Line\n",
"* 9 = Hydrate in Service Line\n",
"\n",
"Each file follows the naming pattern:\n",
"* WELL-00008_20170818000222.parquet\n",
"\n",
"* WELL-00008: Identification of the well.\n",
"* 20170818000222: Timestamp in the format yyyyMMddhhmmss.\n",
"* .parquet: File extension indicating the data format."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from toolkit.misc import load_and_combine_data, classify_events, visualize_data\n",
"\n",
"datatype = 'SIMULATED'\n",
"df = load_and_combine_data(dataset_dir, datatype)\n",
"\n",
"if df is not None:\n",
" event_summary = classify_events(df)\n",
"\n",
" visualize_data(event_summary)\n",
"else:\n",
" print(\"Nenhum dado foi carregado.\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
3 changes: 3 additions & 0 deletions toolkit/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -107,4 +107,7 @@
load_instances,
resample,
plot_instance,
load_and_combine_data,
classify_events,
visualize_data
)
126 changes: 126 additions & 0 deletions toolkit/misc.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
import matplotlib.dates as mdates
import matplotlib.colors as mcolors
import os
import dask.dataframe as dd

from matplotlib.patches import Patch
from pathlib import Path
Expand All @@ -35,9 +36,134 @@
PARQUET_ENGINE,
)

folder_mapping = {
0: 'Normal Operation', 1: 'Abrupt Increase of BSW', 2: 'Spurious Closure of DHSV',
3: 'Severe Slugging', 4: 'Flow Instability', 5: 'Rapid Productivity Loss',
6: 'Quick Restriction in PCK', 7: 'Scaling in PCK', 8: 'Hydrate in Production Line',
9: 'Hydrate in Service Line'
}


# Methods
#

def load_and_combine_data(dataset_dir, datatype):
"""
Loads and combines Parquet files from multiple folders, adding additional columns
for folder ID, date, and time extracted from the file names.

Args:
----------
dataset_dir : str
Path to the root directory containing subfolders (0 to 9) with Parquet files.

datatype : str
Type that user need to remove of the dataset for a specific analysis

Returns:
--------
dask.DataFrame or None
A combined Dask DataFrame with all the data from the Parquet files, or None
if no files were found.

Functionality:
--------------
- Iterates through folders (0-9) and loads all valid Parquet files (ignoring those
starting with 'SIMULATED' or other type defined by user).
- Extracts date and time from the filename and adds them as new columns ('data', 'hora').
- Adds a 'folder_id' column to identify the folder each file originated from.

Example:
--------
df = load_and_combine_data('/path/to/dataset')
"""
dfs = []
for folder in range(10):
folder_path = os.path.join(dataset_dir, str(folder))
if os.path.exists(folder_path):
for file_name in os.listdir(folder_path):
if file_name.endswith('.parquet') and not file_name.startswith(datatype): #removal according to the user's wishes
df = dd.read_parquet(os.path.join(folder_path, file_name))
file_name_without_ext = os.path.splitext(file_name)[0]
date_str = file_name_without_ext.split.split('_')[1]
formatted_date = f"{date_str[:4]}-{date_str[4:6]}-{date_str[6:8]}"
formatted_time = f"{date_str[8:10]}:{date_str[10:12]}:{date_str[12:]}"
df = df.assign(folder_id=folder, data=formatted_date, hora=formatted_time)
dfs.append(df)
return dd.concat(dfs) if dfs else None

def classify_events(df):
"""
Classifies events in the dataset by folder and event type, and summarizes the
occurrences of different event types.

Args:
----------
df : dask.DataFrame
The DataFrame containing the event data, including a 'folder_id' column and a 'class' column.

Returns:
--------
dict
A dictionary summarizing the count of events by event type ('Normal Operation',
'Transient', 'Permanent Anomaly') for each folder.

Functionality:
--------------
- For each folder (0-9), counts the occurrences of events in three categories:
- 'Normal Operation': Events classified as 0.
- 'Transient': Events classified between 1 and 9.
- 'Permanent Anomaly': Events classified between 101 and 109.

Example:
--------
event_summary = classify_events(df)
"""
data = {folder_mapping[i]: {'Normal Operation': 0, 'Transient': 0, 'Permanent Anomaly': 0} for i in range(10)}
for folder in range(10):
folder_data = df[df['folder_id'] == folder]
if len(folder_data.index) > 0:
dtb = folder_data['class'].value_counts().compute()
data[folder_mapping[folder]]['Normal Operation'] = dtb.get(0, 0)
data[folder_mapping[folder]]['Transient'] = dtb[(dtb.index >= 1) & (dtb.index <= 9)].sum()
data[folder_mapping[folder]]['Permanent Anomaly'] = dtb[(dtb.index >= 101) & (dtb.index <= 109)].sum()
return data

def visualize_data(data):
"""
Visualizes the event distribution by type using a stacked area chart.

Parameters:
----------
data : dict
A dictionary where keys are folder names, and values are dictionaries with
counts of different event types ('Normal Operation', 'Transient', 'Permanent Anomaly').

Returns:
--------
None
Displays a stacked area chart showing the distribution of event types for each folder.

Functionality:
--------------
- Converts the input dictionary into a DataFrame for plotting.
- Generates a stacked area chart with event types represented in different colors.
- Adds labels for the x and y axes, and a title.

Example:
--------
visualize_data(event_summary)
"""
df_plot = pd.DataFrame(data).T
df_plot.plot(kind='area', stacked=True, color=['blue', 'orange', 'purple'], figsize=(14, 8), alpha=0.6)
plt.title('Occurrences by Event Type', fontsize=16)
plt.xlabel('Situations', fontsize=14)
plt.ylabel('Amount', fontsize=14)
plt.xticks(rotation=45, ha='right', fontsize=12)
plt.legend(title='Event Type', loc='upper right')
plt.tight_layout()
plt.show()

def label_and_file_generator(real=True, simulated=False, drawn=False):
"""This is a generating function that returns tuples for all
indicated instance sources (`real`, `simulated` and/or
Expand Down