Speech Command Recognition Project

Overview

This project focuses on building a robust keyword recognition system using the Speech Commands Dataset v2. The dataset consists of one-second audio files containing spoken English words, enabling the training of machine learning models for real-time keyword detection. The system aims to:

Accurately recognize command words from short audio clips.
Perform robustly in noisy environments.
Be customizable with additional user-specific data for fine-tuning.

Dataset Description

The Speech Commands Dataset is a collection of 105,829 audio samples, each containing a single spoken English word. The dataset is organized into 35 categories, including:

Core Commands: Yes, No, Up, Down, Left, Right, On, Off, Stop, Go.
Auxiliary Words: Cat, Dog, Bird, Tree.
Noise Samples: Background noise (white noise, pink noise, running water, etc.).

Key Features of the Dataset:

Diversity: Contains recordings from speakers with varied ages, genders, and accents.
Partitioning:
- Training Data: 80% of the dataset.
- Validation Data: 10%.
- Test Data: 10%.

Preprocessing

To prepare the audio data for machine learning, several preprocessing steps are applied:

Audio Normalization: Ensures uniform loudness across recordings.
Spectrogram Conversion: Converts the raw audio into Mel-spectrograms for easier analysis by convolutional neural networks (CNNs).
Noise Augmentation: Adds background noise to training samples to enhance model robustness in real-world environments.

Sample Preprocessing Code

import tensorflow as tf
import numpy as np
import os

# Load the dataset
commands_dir = 'path_to_speech_commands_dataset'
commands = np.array(tf.io.gfile.listdir(commands_dir))
commands = commands[commands != '_background_noise_']

print(f'Commands: {commands}')

# Preprocess function
def preprocess_audio(file_path):
    audio = tf.io.read_file(file_path)
    audio, _ = tf.audio.decode_wav(audio)
    audio = tf.squeeze(audio, axis=-1)
    audio = tf.signal.stft(audio, frame_length=255, frame_step=128)
    spectrogram = tf.abs(audio)
    return spectrogram

# Example Usage
example_file = os.path.join(commands_dir, 'yes', 'sample_audio.wav')
spectrogram = preprocess_audio(example_file)

Model Architecture

The system uses a Convolutional Neural Network (CNN), which excels in processing spectrograms. The architecture is designed to recognize spatial and temporal patterns in audio data.

Model Details

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout

def create_cnn_model(input_shape):
    model = Sequential([
        Conv2D(32, (3, 3), activation='relu', input_shape=input_shape),
        MaxPooling2D(2, 2),
        Conv2D(64, (3, 3), activation='relu'),
        MaxPooling2D(2, 2),
        Flatten(),
        Dense(128, activation='relu'),
        Dropout(0.5),
        Dense(35, activation='softmax')  # 35 classes for 35 words
    ])
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    return model

input_shape = (64, 64, 1)
model = create_cnn_model(input_shape)
model.summary()

Training Parameters

Optimizer: Adam.
Loss Function: Categorical Crossentropy.
Learning Rate: 0.001.
Batch Size: 64.
Epochs: 30.

Results and Performance

Key Metrics:

Validation Accuracy: 86%
Test Accuracy: 85.65%
Loss on Test Set: 0.629

Training Progress:

import matplotlib.pyplot as plt

# Plot training history
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.legend()
plt.title('Accuracy Over Epochs')

plt.subplot(1, 2, 2)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.legend()
plt.title('Loss Over Epochs')
plt.show()

Confusion Matrix

The confusion matrix highlights which command words are frequently misclassified. For instance, words like “No” and “Go” were occasionally confused due to phonetic similarities.

Fine-Tuning and Customization

To improve performance for specific users, the model can be fine-tuned with additional recordings. Custom datasets (e.g., 30 samples per word) were used to:

Enhance personalization.
Improve accuracy in specific environments.

Challenges and Solutions:

Overfitting: Addressed using dropout and noise augmentation.
Generalization: Maintained by balancing original and custom data.

Future Improvements

Experimenting with advanced architectures such as recurrent neural networks (RNNs) or transformers.
Deploying the model for real-time inference on mobile or edge devices.
Enhancing noise augmentation techniques to improve performance in challenging environments.

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
LICENSE		LICENSE
README.md		README.md
app.py		app.py
dataset.md		dataset.md
model.ipynb		model.ipynb
requirements.txt		requirements.txt
tempCodeRunnerFile.py		tempCodeRunnerFile.py
wavmodel.keras		wavmodel.keras

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Speech Command Recognition Project

Overview

Dataset Description

Key Features of the Dataset:

Preprocessing

Sample Preprocessing Code

Model Architecture

Model Details

Training Parameters

Results and Performance

Key Metrics:

Training Progress:

Confusion Matrix

Fine-Tuning and Customization

Challenges and Solutions:

Future Improvements

Resources

About

Releases

Packages

Contributors 2

Languages

License

yxshee/speech-command-recognition

Folders and files

Latest commit

History

Repository files navigation

Speech Command Recognition Project

Overview

Dataset Description

Key Features of the Dataset:

Preprocessing

Sample Preprocessing Code

Model Architecture

Model Details

Training Parameters

Results and Performance

Key Metrics:

Training Progress:

Confusion Matrix

Fine-Tuning and Customization

Challenges and Solutions:

Future Improvements

Resources

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages