Summarization NLP

This project is designed to generate concise and coherent summaries from extensive textual data. Leveraging advanced machine learning algorithms and state-of-the-art deep learning architectures, this project aims to facilitate efficient information digestion, enabling users to grasp key insights swiftly.

Project Overview

In the era of information overload, the ability to distill vast amounts of text into succinct summaries is invaluable. Summarization NLP addresses this need by providing automated tools to generate high-quality summaries from diverse textual sources, including articles, reports, and social media content. Whether you're a researcher aiming to synthesize literature or a professional seeking quick insights, this project offers reliable and efficient summarization capabilities.

The fine-tuned T5 model used in this project is available on Hugging Face, making it easy to integrate into your NLP workflows.

Features

Abstractive Summarization: Generates novel sentences that capture the essence of the input text, mimicking human-like summaries.
Customizable Summary Length: Allows users to specify the desired length of the summary.
API Integration: Offers RESTful APIs for seamless integration into other applications and services.
User-Friendly Interface: Intuitive web interface for easy access and usage.

Dataset Information

Source

The dataset used for fine-tuning this model is the XL-Sum dataset. XL-Sum is a multilingual summarization dataset that provides professionally written summaries for news articles across 44 languages.

DataFrame Overview

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300000 entries, 0 to 299999
Data columns (total 3 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   id       300000 non-null  int64 
 1   article  300000 non-null  object
 2   summary  300000 non-null  object
dtypes: int64(1), object(2)
memory usage: 6.8+ MB

Sample Data

First 5 Rows of the Dataset:

id	article	summary
1	The quick brown fox jumps over the lazy dog...	Quick fox jumps over lazy dog.
2	In recent news, the stock market has seen significant...	Stock market experiences significant changes.
3	Advances in artificial intelligence have paved the way...	AI advancements pave the way for future technologies.
4	The culinary world has been revolutionized by...	Culinary world sees major changes.
5	Environmental concerns are at an all-time high as...	Environmental concerns rise sharply.

Model Architecture

Chosen Models

This project employs the T5 (Text-to-Text Transfer Transformer), a versatile model that treats every NLP problem as a text generation task. The T5 model has been fine-tuned on the XL-Sum dataset for generating high-quality abstractive summaries.

Training Strategy

Data Splitting:
- Training Set: 80%
- Validation Set: 10%
- Testing Set: 10%
Optimization Algorithm:
- AdamW optimizer was used for efficient training.
Frameworks:
- TensorFlow was used to train the model.

Model Evaluation

Performance Metrics

ROUGE Scores :

Rouge1 : 0.23815098039215686

Rouge2 : 0.05604331811023622

RougeL : 0.12156862745098039

RougeLsum : 0.1546758823529412

Sample Summaries

Original Article:

"In recent news, the stock market has seen significant volatility due to geopolitical tensions. Investors are concerned about the potential impact on global trade and economic stability. Analysts suggest that diversification and cautious investment strategies are advisable in the current climate."

Generated Summary:

"Geopolitical tensions have caused significant volatility in the stock market, raising concerns about global trade and economic stability. Analysts recommend diversification and cautious investment strategies."

Installation

To set up the Summarization NLP project locally, follow the steps below:

Clone the Repository:

git clone https://github.com/yxshee/summarization-nlp.git
cd summarization-nlp

Install Required Dependencies:
```
pip install -r requirements.txt
```

Download the Fine-Tuned Model:

The fine-tuned T5 model is available on Hugging Face. Download the model and tokenizer:

from transformers import AutoTokenizer, TFAutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("yxshee/t5-transformer")
model = TFAutoModelForSeq2SeqLM.from_pretrained("yxshee/t5-transformer")

Prepare the Dataset:
- If using custom datasets, update the configuration accordingly.
Run Preprocessing Scripts:
```
python preprocess.py
```
Train the Model:
```
python train.py
```

Usage

Generating Summaries

You can generate summaries using the fine-tuned T5 model.

Example Using Python:

from transformers import AutoTokenizer, TFAutoModelForSeq2SeqLM

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("yxshee/t5-transformer")
model = TFAutoModelForSeq2SeqLM.from_pretrained("yxshee/t5-transformer")

# Input text
text = "In recent news, the stock market has seen significant volatility due to geopolitical tensions..."

# Tokenize input
inputs = tokenizer("summarize: " + text, return_tensors="tf", max_length=512, truncation=True)

# Generate summary
outputs = model.generate(inputs["input_ids"], max_length=100, num_beams=4, early_stopping=True)

# Decode and print the summary
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Deployment

The fine-tuned T5 model can be deployed as a RESTful API or integrated into existing NLP pipelines. Refer to the detailed deployment instructions in the repository.

Future Enhancements

Multi-Language Support: Extend summarization functionality to other languages.
Real-Time Summarization: Optimize the model for real-time summarization tasks.
Interactive Web Interface: Develop an enhanced web interface for batch processing and history tracking.

Contributing

Contributions are welcome! Please refer to the contributing guidelines in the repository for more details.

License

This project is licensed under the MIT License. You are free to use, modify, and distribute this software under the terms of the license.

Acknowledgements

Hugging Face: For providing the T5 model and the XL-Sum dataset.
TensorFlow: For enabling efficient model training and deployment.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
config		config
tokenizer		tokenizer
LICENSE		LICENSE
README.md		README.md
Report.md		Report.md
app.py		app.py
model.py		model.py
requirements.txt		requirements.txt
sum1.ipynb		sum1.ipynb
sum2.ipynb		sum2.ipynb
translation-t5.ipynb		translation-t5.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Summarization NLP

Table of Contents

Project Overview

Features

Dataset Information

Source

DataFrame Overview

Sample Data

Model Architecture

Chosen Models

Training Strategy

Model Evaluation

Performance Metrics

Sample Summaries

Installation

Usage

Generating Summaries

Deployment

Future Enhancements

Contributing

License

Acknowledgements

About

Releases

Packages

Languages

License

yxshee/summarization-nlp

Folders and files

Latest commit

History

Repository files navigation

Summarization NLP

Table of Contents

Project Overview

Features

Dataset Information

Source

DataFrame Overview

Sample Data

Model Architecture

Chosen Models

Training Strategy

Model Evaluation

Performance Metrics

Sample Summaries

Installation

Usage

Generating Summaries

Deployment

Future Enhancements

Contributing

License

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages