Role of Machine Learning and Computation in memento by Vaibhav Jain

Detailed ML Pipeline Breakdown

Installation and Setup

Overview

This section outlines the installation and setup requirements for running the MEMENTO ML pipeline. Follow these steps to set up your development environment and install all necessary dependencies.

System Requirements

Python 3.8 or higher
pip (Python package installer)
Git (for version control)
At least 4GB RAM (8GB recommended)
Firebase account and credentials

Installation Steps

# 1. Clone the repository
git clone https://github.com/0209vaibhav/Machine-Learning_MEMENTO.git
cd Machine-Learning_MEMENTO

# 2. Create and activate virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# 3. Install required packages
pip install -r requirements.txt

# 4. Set up Firebase credentials
# Place your Firebase credentials JSON file in:
# ml_pipeline/input/step1_data_loader/firebase_credentials.json

Required Packages

# Core ML and Data Processing
scikit-learn>=1.0.2
pandas>=1.3.0
numpy>=1.21.0
nltk>=3.6.0

# Web Scraping and Data Collection
beautifulsoup4>=4.9.3
requests>=2.26.0
selenium>=4.0.0

# Firebase Integration
firebase-admin>=5.0.0

# Text Processing
spacy>=3.1.0
langdetect>=1.0.9

# Utilities
python-dotenv>=0.19.0
tqdm>=4.62.0
joblib>=1.0.0

Configuration Files

The pipeline requires the following configuration files to be set up:

Firebase Credentials:

{
    "type": "service_account",
    "project_id": "your-project-id",
    "private_key_id": "your-key-id",
    "private_key": "your-private-key",
    "client_email": "your-client-email",
    "client_id": "your-client-id",
    "auth_uri": "https://accounts.google.com/o/oauth2/auth",
    "token_uri": "https://oauth2.googleapis.com/token",
    "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
    "client_x509_cert_url": "your-cert-url"
}

Environment Variables:

# .env file
FIREBASE_CREDENTIALS_PATH=ml_pipeline/input/step1_data_loader/firebase_credentials.json
MODEL_OUTPUT_DIR=ml_pipeline/output
SCRAPING_INTERVAL=3600  # in seconds
CONFIDENCE_THRESHOLD=0.4

Verification Steps

After installation, verify your setup by running:

# 1. Verify Python version
python --version  # Should be 3.8 or higher

# 2. Verify package installation
pip list  # Should show all required packages

# 3. Verify Firebase connection
python -c "from firebase_admin import credentials, initialize_app; initialize_app(credentials.Certificate('ml_pipeline/input/step1_data_loader/firebase_credentials.json'))"

# 4. Run test script
python ml_pipeline/tests/test_setup.py

Common Issues and Solutions

Firebase Connection Issues:
- Verify credentials file path and format
- Check internet connection
- Ensure Firebase project is active
Package Installation Issues:
- Update pip: python -m pip install --upgrade pip
- Clear pip cache: pip cache purge
- Install packages individually if needed
Memory Issues:
- Reduce batch size in model training
- Use smaller model architectures
- Enable garbage collection

Step 1: Data Loading and Preparation

Overview

This step is responsible for loading user mementos from Firebase and preparing them for ML training. It serves as the foundation for our machine learning pipeline by transforming raw user data into a structured format suitable for model training.

Purpose

The main objectives of this step are:

Load user mementos from Firebase database
Clean and normalize text data
Extract relevant features for ML training
Prepare data in a format suitable for model training

Implementation Details

The implementation uses the following key components:

MementoDataLoader Class: Main class handling data loading and processing
Text Preprocessing: Using NLTK for text cleaning and normalization
Feature Extraction: Combining multiple text fields for better ML features
Data Validation: Handling missing values and data type conversions

Inputs

The step requires the following inputs:

Firebase credentials (JSON file)

{
    "type": "service_account",
    "project_id": "your-project-id",
    "private_key_id": "your-key-id",
    "private_key": "your-private-key",
    "client_email": "your-client-email",
    "client_id": "your-client-id",
    "auth_uri": "https://accounts.google.com/o/oauth2/auth",
    "token_uri": "https://oauth2.googleapis.com/token",
    "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
    "client_x509_cert_url": "your-cert-url"
}

Categories metadata (JSON file)

[
  {
    "id": "architecture",
    "name": "Architecture",
    "symbol": "🏛️",
    "keywords": ["building", "structure", "design", "architecture", "landmark", "historic", "monument", "architectural"]
  },
  {
    "id": "urban-nature",
    "name": "Urban Nature",
    "symbol": "🌿",
    "keywords": ["park", "garden", "nature", "green space", "plants", "trees", "urban nature", "outdoor"]
  },
  // ... more categories ...
]

Tags metadata (JSON file)

[
  {
    "id": "ephemeral",
    "name": "Ephemeral",
    "symbol": "🌀",
    "keywords": ["temporary", "fleeting", "short-lived", "momentary", "brief", "passing", "transient", "ephemeral"]
  },
  {
    "id": "unmapped",
    "name": "Unmapped",
    "symbol": "📍",
    "keywords": ["hidden", "undiscovered", "secret", "unknown", "unexplored", "off-map", "unmapped", "new"]
  },
  // ... more tags ...
]

Durations metadata (JSON file)

[
  {
    "id": "15min",
    "name": "15 Minutes",
    "symbol": "⚡",
    "keywords": ["15 minutes", "quarter hour", "quick stop", "brief moment", "passing", "fleeting"]
  },
  {
    "id": "1hr",
    "name": "1 Hour",
    "symbol": "⏱️",
    "keywords": ["1 hour", "one hour", "hour long", "60 minutes", "quick visit"]
  },
  // ... more durations ...
]

Outputs

The step produces the following outputs:

Processed DataFrame saved as CSV
Logs of the data processing steps

Loading outputs metadata...

Complete Implementation

"""
Step 1: Data Loading and Preparation

This module is responsible for loading user mementos from Firebase for ML training.
These user mementos will be used as training data for the ML models.
"""

import json
import os
import pandas as pd
import numpy as np
from typing import List, Dict, Optional, Tuple
import logging
from sklearn.preprocessing import MultiLabelBinarizer
import firebase_admin
from firebase_admin import credentials, firestore
from datetime import datetime
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re

class MementoDataLoader:
    def __init__(self, 
                 categories_path: str = None,
                 tags_path: str = None,
                 durations_path: str = None,
                 firebase_credentials_path: str = None):
        """Initialize the data loader"""
        self.categories = {}
        self.tags = {}
        self.durations = {}
        self.db = None
        self.lemmatizer = WordNetLemmatizer()
        self.stop_words = set(stopwords.words('english'))
        
        # Load metadata if paths provided
        if categories_path:
            self.load_categories(categories_path)
        if tags_path:
            self.load_tags(tags_path)
        if durations_path:
            self.load_durations(durations_path)
            
        # Initialize Firebase if credentials provided
        if firebase_credentials_path:
            self.firebase_credentials_path = firebase_credentials_path
            self._initialize_firebase()

    def _preprocess_text(self, text: str) -> str:
        """Preprocess text for ML"""
        if not isinstance(text, str):
            text = str(text)
        text = text.lower()
        text = re.sub(r'[^a-z\s]', ' ', text)
        text = ' '.join(text.split())
        return text

    def _extract_text_for_ml(self, memento: Dict) -> str:
        """Extract and preprocess text for ML from memento"""
        text_parts = []
        name = memento.get('name', '')
        text_parts.extend([name] * 3)
        description = memento.get('description', '')
        text_parts.append(description)
        location = memento.get('location', {})
        if isinstance(location, dict):
            location_str = str(location)
            text_parts.append(location_str)
        combined_text = ' '.join(filter(None, text_parts))
        return self._preprocess_text(combined_text)

    def prepare_training_data(self, df: pd.DataFrame, output_dir: str) -> Dict[str, str]:
        """Prepare data for training"""
        os.makedirs(output_dir, exist_ok=True)
        processed_data = []
        for _, memento in df.iterrows():
            memento_dict = memento.to_dict()
            text_for_ml = self._extract_text_for_ml(memento_dict)
            memento_dict['text_for_ml'] = text_for_ml
            processed_data.append(memento_dict)
        
        processed_df = pd.DataFrame(processed_data)
        output_path = os.path.join(output_dir, 'user_mementos_processed.csv')
        processed_df.to_csv(output_path, index=False)
        return {'processed_data': output_path}

Step 2: Model Architecture Definition

Overview

This step defines the machine learning model architectures used for classifying mementos. It sets up three specialized models for different aspects of memento classification: categories, tags, and duration estimation.

Purpose

The main objectives of this step are:

Define model architectures for category classification
Set up multi-label classification for tag prediction
Create duration estimation model
Configure model hyperparameters and evaluation metrics

Implementation Details

The implementation includes three main models:

Category Classifier:
- Multi-class classification model
- Uses TF-IDF vectorization
- RandomForestClassifier with 200 estimators
Tag Predictor:
- Multi-label classification model
- OneVsRestClassifier with LinearSVC
- Handles multiple tags per memento
Duration Estimator:
- Multi-class classification model
- Maps text to duration categories
- Uses RandomForestClassifier

Inputs

The step requires the following inputs:

Processed data from Step 1 (CSV file)
Categories metadata (JSON file)
Tags metadata (JSON file)
Durations metadata (JSON file)

# Example input data structure
{
    "text_for_ml": "processed text content",
    "category": "category_id",
    "tags": ["tag1", "tag2"],
    "duration": "duration_value"
}

Outputs

The step produces the following outputs:

                    1. Model Information (model_info.json)
                            
2. Model Metrics (model_metrics.json)

Complete Implementation

"""
Step 2: Model Architecture Definition

This module defines the ML model architectures used for classifying mementos.
These models will be trained in Step 3 using user mementos from Firebase.
"""

import os
import json
import pickle
import numpy as np
import pandas as pd
import logging
from typing import Dict, List, Tuple, Optional
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, accuracy_score, f1_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MultiLabelBinarizer

class MementoModelTrainer:
    def __init__(self, 
                categories_path: str, 
                tags_path: str,
                durations_path: str,
                output_dir: str = "."):
        """Initialize the model trainer"""
        self.categories_path = categories_path
        self.tags_path = tags_path
        self.durations_path = durations_path
        self.output_dir = output_dir
        
        # Create output directory
        os.makedirs(output_dir, exist_ok=True)
        
        # Load metadata
        self.categories, self.category_ids = self._load_categories()
        self.tags, self.tag_ids = self._load_tags()
        self.durations, self.duration_ids = self._load_durations()
        
        # Initialize models
        self.vectorizer = None
        self.category_model = None
        self.tags_model = None
        self.duration_model = None

    def train_models(self, 
                     data_path: str, 
                     test_size: float = 0.2, 
                     random_state: int = 42,
                     use_grid_search: bool = False) -> Dict:
        """Train category, tag, and duration models"""
        # Load and prepare data
        X, y_category, y_tags, y_duration = self.load_data(data_path)
        
        # Create text vectorizer
        self.vectorizer = TfidfVectorizer(
            max_features=5000,
            min_df=2,
            max_df=0.8,
            ngram_range=(1, 2),
            stop_words='english'
        )
        
        # Train models
        self._train_category_model(X, y_category)
        self._train_tags_model(X, y_tags)
        self._train_duration_model(X, y_duration)
        
        # Evaluate and save models
        metrics = self._evaluate_models(X, y_category, y_tags, y_duration)
        self._save_models()
        self.save_model_info(metrics)
        
        return metrics

Step 3: Model Training

Overview

This step trains the machine learning models defined in Step 2 using the processed user mementos from Step 1. It includes comprehensive model training, evaluation, and validation processes.

Purpose

The main objectives of this step are:

Train category classification model
Train multi-label tag prediction model
Train duration estimation model
Evaluate model performance using various metrics
Save trained models and evaluation results

Implementation Details

The implementation includes several key components:

Hyperparameter Tuning:
- Grid search for optimal parameters
- Cross-validation for model validation
- Handling of small datasets and multi-label cases
Model Training:
- Category model training with class balancing
- Tag model training with multi-label support
- Duration model training with custom metrics
Model Evaluation:
- Accuracy and F1 score calculation
- Classification reports generation
- Confusion matrix analysis

Inputs

The step requires the following inputs:

Processed data from Step 1
Model architectures from Step 2
Hyperparameter configurations

# Example hyperparameter grid
param_grid = {
    'C': [0.1, 1, 10],
    'max_iter': [1000],
    'class_weight': ['balanced'],
    'solver': ['liblinear', 'saga']
}

Outputs

The step produces the following outputs:

Trained model files (pickle format)
Model evaluation metrics
Training logs

Model Training Report
                        {
    "training_date": "2024-03-15",
    "model_versions": {
        "category_classifier": "v1.2.0",
        "tag_predictor": "v1.1.0",
        "duration_estimator": "v1.0.1"
    },
    "training_metrics": {
        "category_classification": {
            "accuracy": 0.8132,
            "f1_score": 0.7945,
            "precision": 0.8023,
            "recall": 0.7869,
            "training_samples": 1250,
            "validation_samples": 312
        },
        "tag_prediction": {
            "micro_f1": 0.8895,
            "macro_f1": 0.8567,
            "hamming_loss": 0.1105,
            "training_samples": 1250,
            "validation_samples": 312
        },
        "duration_estimation": {
            "accuracy": 0.7234,
            "mean_absolute_error": 0.4567,
            "training_samples": 1250,
            "validation_samples": 312
        }
    },
    "hyperparameters": {
        "category_classifier": {
            "model_type": "RandomForestClassifier",
            "n_estimators": 200,
            "max_depth": 10,
            "min_samples_split": 5
        },
        "tag_predictor": {
            "model_type": "MultiLabelClassifier",
            "threshold": 0.3,
            "base_estimator": "LinearSVC"
        },
        "duration_estimator": {
            "model_type": "DecisionTreeClassifier",
            "max_depth": 5,
            "min_samples_leaf": 10
        }
    },
    "training_time": {
        "data_preparation": "45.2s",
        "model_training": "189.7s",
        "validation": "23.4s",
        "total": "258.3s"
    },
    "validation_notes": [
        "Category classification shows strong performance on major categories",
        "Tag prediction achieves high micro-F1 score indicating good overall accuracy",
        "Duration estimation needs improvement, particularly for longer durations",
        "Model size optimized for production deployment"
    ]
}
                    
Trained Models
                        Vectorizer: 01_vectorizer.pkl
Category Models: 
                                01_category_model.pkl
02_category_model.pkl

                            
Tags Model: 02_tags_model.pkl
Duration Model: 02_duration_model.pkl

                    

Complete Implementation

"""
Step 3: Model Training

This module trains the ML models defined in Step 2 using user mementos from Step 1.
The trained models will be used to classify scraped data in Step 5.
"""

import os
import json
import logging
import pandas as pd
import numpy as np
from typing import Dict, List, Tuple, Optional
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score
from sklearn.preprocessing import MultiLabelBinarizer
import pickle
from datetime import datetime

class MementoModelTrainer:
    def __init__(self, 
                 input_dir: str = "ml_pipeline/output/step1_data_processing/processed_data",
                 output_dir: str = "ml_pipeline/output/step3_model_training"):
        """Initialize the trainer"""
        self.input_dir = input_dir
        self.output_dir = output_dir
        self.models_dir = os.path.join(output_dir, "models")
        self.metrics_dir = os.path.join(output_dir, "metrics")
        self.reports_dir = os.path.join(output_dir, "reports")
        
        # Create output directories
        os.makedirs(self.models_dir, exist_ok=True)
        os.makedirs(self.metrics_dir, exist_ok=True)
        os.makedirs(self.reports_dir, exist_ok=True)
        
        # Define hyperparameter grids
        self.param_grid = {
            'C': [0.1, 1, 10],
            'max_iter': [1000],
            'class_weight': ['balanced'],
            'solver': ['liblinear', 'saga']
        }

    def train_models(self):
        """Train all models"""
        # Load and prepare data
        df = self.load_training_data()
        X, y_dict = self.prepare_training_data(df)
        
        # Train models
        category_metrics = self.train_category_model(X, y_dict['category'])
        tags_metrics = self.train_tags_model(X, y_dict['tags'])
        duration_metrics = self.train_duration_model(X, y_dict['duration'])
        
        # Save vectorizer
        self.save_vectorizer()
        
        # Generate and save reports
        metrics = {
            'category': category_metrics,
            'tags': tags_metrics,
            'duration': duration_metrics
        }
        self.generate_model_report(metrics)

Step 4: Data Scraping

Overview

This step scrapes data from Secret NYC to create a testing dataset for our ML models. It collects public content that will be processed and classified in subsequent steps.

Purpose

The main objectives of this step are:

Scrape articles from Secret NYC website
Extract relevant information from articles
Clean and structure the scraped data
Prepare data for ML classification

Implementation Details

The implementation includes several key components:

Web Scraping:
- Robust session management with retry strategy
- BeautifulSoup for HTML parsing
- Error handling and logging
Data Extraction:
- Title and description extraction
- Location geocoding using OpenStreetMap
- Duration extraction using regex patterns
- Date parsing and formatting
Data Cleaning:
- Text cleaning and formatting
- Language detection and filtering
- Media URL extraction and validation

Inputs

The step requires the following inputs:

Secret NYC website URL
Output directory configuration

# Example configuration
base_url = "https://secretnyc.co/things-to-do/"
output_dir = "ml_pipeline/output/step4_scraped_data"

Outputs

The step produces the following outputs:

Raw scraped data in JSON format
Structured memento objects
Scraping logs

Loading scraped data outputs...

Complete Implementation

"""
Step 4: Scrape Data

This module scrapes data from Secret NYC to create a testing dataset.
"""

import requests
from bs4 import BeautifulSoup
import json
import os
import re
import time
from datetime import datetime
from langdetect import detect
from typing import Dict, List, Optional
import logging
from requests.adapters import HTTPAdapter
from urllib3.util import Retry

class SecretNYCScraper:
    def __init__(self, 
                 base_url: str = "https://secretnyc.co/things-to-do/",
                 output_dir: str = "ml_pipeline/output/step4_scraped_data"):
        """Initialize scraper"""
        self.base_url = base_url
        self.output_dir = output_dir
        self.raw_data_dir = os.path.join(output_dir, "raw_data")
        os.makedirs(self.raw_data_dir, exist_ok=True)
        self.default_user = "Secret NYC"

    def create_session(self) -> requests.Session:
        """Create session with retry strategy"""
        session = requests.Session()
        retry_strategy = Retry(
            total=3,
            backoff_factor=1,
            status_forcelist=[429, 500, 502, 503, 504],
        )
        adapter = HTTPAdapter(max_retries=retry_strategy)
        session.mount("http://", adapter)
        session.mount("https://", adapter)
        session.headers.update({
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        })
        return session

    def scrape_article(self, article_url: str) -> Optional[Dict]:
        """Scrape a single article"""
        session = self.create_session()
        try:
            response = session.get(article_url, timeout=10)
            response.raise_for_status()
            soup = BeautifulSoup(response.text, "html.parser")
            
            # Extract article data
            title = soup.find("h1").get_text(strip=True) if soup.find("h1") else "Untitled"
            paragraphs = soup.select("section.article__body p")
            desc = "\n".join(p.get_text(strip=True) for p in paragraphs if p.get_text(strip=True))
            
            # Process and clean data
            cleaned = self.clean_description(desc)
            location = self.extract_fallback_location(desc, title)
            duration = self.extract_duration(desc)
            
            # Create memento object
            memento = {
                "userId": self.default_user,
                "location": self.geocode_location(location) if location else {},
                "media": self.extract_media(soup),
                "name": title,
                "description": cleaned,
                "category": "Other",
                "timestamp": self.parse_date(self.extract_date(soup)),
                "tags": ["Other"],
                "link": article_url,
                "mementoType": "public",
                "duration": duration
            }
            
            return memento
            
        except Exception as e:
            logging.error(f"Error scraping {article_url}: {e}")
            return None

Step 5: Process Scraped Data

Overview

This step processes the scraped data from Step 4 through the trained models from Step 3. It classifies each scraped memento into categories, tags, and durations, applying quality control measures to ensure accurate predictions.

Purpose

The main objectives of this step are:

Process scraped data through trained ML models
Classify mementos into categories, tags, and durations
Validate prediction confidence and quality
Generate processing reports and statistics

Implementation Details

The implementation includes several key components:

Model Integration:
- Loading trained models from Step 3
- Text vectorization and preprocessing
- Multi-model prediction pipeline
Quality Control:
- Confidence threshold validation
- Prediction quality checks
- Filtering low-confidence predictions
Data Processing:
- Text feature extraction
- Multi-label tag prediction
- Duration estimation

Inputs

The step requires the following inputs:

Scraped data from Step 4 (JSON format)
Trained models from Step 3 (pickle files)
Confidence thresholds configuration

# Example confidence thresholds
confidence_thresholds = {
    "category": 0.4,
    "tags": 0.3,
    "duration": 0.4
}

Outputs

The step produces the following outputs:

Processed data with predictions (CSV, JSON):
Processing report (JSON):
Validation logs and outputs (directory):
- ml_pipeline/output/step5_processed_data/validation/ (for validation logs and future outputs)

Complete Implementation

"""
Step 5: Process Scraped Data

This module processes scraped data from Step 4 through trained models from Step 3.
It classifies each scraped memento into categories, tags, and durations.
"""

import json
import os
import pandas as pd
import numpy as np
from typing import List, Dict, Optional, Tuple
import logging
import pickle
from sklearn.preprocessing import MultiLabelBinarizer
from datetime import datetime

class ScrapedDataProcessor:
    def __init__(self, 
                 model_dir: str = "ml_pipeline/output/step3_model_training/models",
                 output_dir: str = "ml_pipeline/output/step5_processed_data"):
        """Initialize the processor"""
        self.model_dir = model_dir
        self.output_dir = output_dir
        self.processed_data_dir = os.path.join(output_dir, "processed_data")
        self.reports_dir = os.path.join(output_dir, "reports")
        self.validation_dir = os.path.join(output_dir, "validation")
        
        # Confidence thresholds
        self.confidence_thresholds = {
            "category": 0.4,
            "tags": 0.3,
            "duration": 0.4
        }
        
        # Create output directories
        os.makedirs(self.processed_data_dir, exist_ok=True)
        os.makedirs(self.reports_dir, exist_ok=True)
        os.makedirs(self.validation_dir, exist_ok=True)
        
        # Load models
        self._load_models()

    def process_memento(self, memento: Dict) -> Dict:
        """Process a single memento"""
        # Extract text for ML
        text = self._extract_text_for_ml(memento)
        
        # Vectorize text
        text_vec = self.vectorizer.transform([text])
        
        # Get predictions
        category_pred, category_conf = self._predict_category(text_vec)
        tags_pred, tags_conf = self._predict_tags(text_vec)
        duration_pred, duration_conf = self._predict_duration(text_vec)
        
        # Validate predictions
        validation = self._validate_predictions(
            category_pred, category_conf,
            tags_pred, tags_conf,
            duration_pred, duration_conf
        )
        
        # Update memento with predictions
        memento.update({
            "category": category_pred,
            "category_confidence": category_conf,
            "tags": tags_pred,
            "tags_confidence": tags_conf,
            "duration": duration_pred,
            "duration_confidence": duration_conf,
            "validation": validation
        })
        
        return memento

Step 6: ML Model Predictor

Overview

This step provides a production-ready predictor class that uses trained ML models to classify new mementos with categories, tags, and durations. It's designed for seamless integration with the scraper and other components of the system.

Purpose

The main objectives of this step are:

Provide a production-ready ML prediction interface
Classify mementos with categories, tags, and durations
Support batch prediction capabilities
Enable seamless integration with other components

Implementation Details

The implementation includes several key components:

Model Management:
- TF-IDF vectorizer loading
- Category classifier integration
- Multi-label tag predictor
- Duration estimator
Prediction Features:
- Configurable prediction thresholds
- Batch prediction support
- Error handling and logging
- Scraper integration utilities
Production Features:
- Model versioning
- Performance optimization
- Memory management
- Robust error handling

Inputs

The step requires the following inputs:

Trained models from Step 3 (pickle files)
Category definitions (JSON)
Tag definitions (JSON)
Duration definitions (JSON)

# Example model paths
models_dir = "ml_pipeline/models"
categories_path = "memento_categories_combined.json"
tags_path = "memento_tags_combined.json"
durations_path = "memento_durations.json"

Outputs

The step produces the following outputs:

Category predictions with confidence scores
Multi-label tag predictions
Duration estimates
Complete memento classifications

# Example prediction output
{
    "category": "🌳 Outdoors",
    "tags": ["🌅 Sunset", "🎵 Music", "🏞️ Parks"],
    "duration": "1-2 hours",
    "confidence_scores": {
        "category": 0.85,
        "tags": [0.75, 0.65, 0.60],
        "duration": 0.80
    }
}

Complete Implementation

"""
Step 6: ML Model Predictor

This module provides a production-ready predictor class that uses trained ML models
to classify new mementos with categories, tags, and durations.
"""

import os
import json
import pickle
import logging
import numpy as np
from typing import List, Dict, Optional, Union

class MementoPredictor:
    def __init__(self, 
                 models_dir: Optional[str] = None,
                 categories_path: Optional[str] = None,
                 tags_path: Optional[str] = None,
                 durations_path: Optional[str] = None,
                 threshold: float = 0.2):
        """Initialize the predictor with trained models"""
        # Set up paths
        self.script_dir = os.path.dirname(os.path.abspath(__file__))
        self.root_dir = os.path.dirname(self.script_dir)
        
        # Set models_dir if not provided
        if models_dir is None:
            models_dir = os.path.join(self.script_dir, "models")
        self.models_dir = models_dir
        
        # Set paths for category, tag, and duration definitions
        self.categories_path = categories_path or os.path.join(self.root_dir, "memento_categories_combined.json")
        self.tags_path = tags_path or os.path.join(self.root_dir, "memento_tags_combined.json")
        self.durations_path = durations_path or os.path.join(self.root_dir, "memento_durations.json")
        
        # Set threshold
        self.threshold = threshold
        
        # Initialize models and data
        self.vectorizer = None
        self.category_model = None
        self.tags_model = None
        self.duration_model = None
        self.categories = None
        self.tags = None
        self.durations = None
        
        # Load models and data
        self._load_categories_and_tags()
        self._load_models()

    def classify_memento(self, 
                        description: str, 
                        context: Optional[Dict] = None) -> Dict[str, Union[str, List[str]]]:
        """Classify a memento with categories, tags, and duration"""
        # Get predictions
        category = self.predict_category(description)
        tags = self.predict_tags(description)
        duration = self.predict_duration(description)
        
        # Return complete classification
        return {
            "category": category,
            "tags": tags,
            "duration": duration
        }

Step 7: ML Pipeline Integration

Overview

This step provides tools to integrate the trained ML models with production systems, particularly focusing on updating existing scrapers to use ML-based classification. It includes robust safety features and comprehensive integration tools.

Purpose

The main objectives of this step are:

Integrate ML models with production systems
Update existing scrapers to use ML classification
Ensure safe and reliable integration
Provide monitoring and maintenance tools

Implementation Details

The implementation includes several key components:

Scraper Integration:
- Automatic scraper detection
- Code analysis and validation
- Safe code modification
- Backup creation
- Rollback capabilities
Safety Features:
- Dry run mode for testing
- Automatic backups
- Validation checks
- Error recovery
- Logging and monitoring
Integration Tools:
- Command-line interface
- Scraper analysis
- Code modification utilities
- Import management
- Path resolution

Inputs

The step requires the following inputs:

Scraper file path or search directory
ML predictor from Step 6
Integration configuration

# Example command-line usage
python step7_integration.py --scraper path/to/scraper.py
python step7_integration.py --find --search-dir path/to/search
python step7_integration.py --scraper path/to/scraper.py --dryrun

Outputs

The step produces the following outputs:

Updated scraper files
Backup files
Integration logs
Analysis reports

# Example analysis output
{
    "path": "path/to/scraper.py",
    "has_assign_category": true,
    "has_assign_tags": true,
    "already_using_ml": false,
    "imports": ["import requests", "from bs4 import BeautifulSoup"],
    "size": 1024,
    "lines": 50
}

Complete Implementation

"""
Step 7: ML Pipeline Integration

This module provides tools to integrate the trained ML models with production systems,
particularly focusing on updating existing scrapers to use ML-based classification.
"""

import os
import sys
import argparse
import logging
import shutil
from datetime import datetime
from typing import Tuple

# Import the predictor
from step6_predictor import MementoPredictor

def find_scrapers(search_dir: str = None) -> list:
    """Find all potential scraper files in the given directory"""
    if search_dir is None:
        search_dir = os.path.dirname(script_dir)
    
    potential_scrapers = []
    
    # Search patterns that indicate a scraper file
    patterns = [
        "scrape_all_pages",
        "scrape_",
        "_scraper",
        "crawler",
    ]
    
    # Walk through the directory
    for root, _, files in os.walk(search_dir):
        for file in files:
            if file.endswith(".py"):
                file_path = os.path.join(root, file)
                
                # Check if filename matches any pattern
                if any(pattern in file.lower() for pattern in patterns):
                    potential_scrapers.append(file_path)
                else:
                    # Check file contents for key functions
                    try:
                        with open(file_path, "r", encoding="utf-8") as f:
                            content = f.read()
                            if "def assign_category" in content and "def assign_tags" in content:
                                potential_scrapers.append(file_path)
                    except Exception:
                        pass
    
    return potential_scrapers

def update_scraper(scraper_path: str, dryrun: bool = False) -> bool:
    """Update a scraper file to use ML-based classification"""
    # First analyze the scraper
    can_update, info = analyze_scraper(scraper_path)
    
    if not can_update:
        logging.error(f"Cannot update {scraper_path}: {info.get('error', 'Missing required functions')}")
        return False
    
    if info.get("already_using_ml", False):
        logging.warning(f"Scraper {scraper_path} is already using ML predictor")
        return False
    
    if dryrun:
        logging.info("DRY RUN: Would update scraper with ML predictor")
        return True
    
    # Create a backup
    backup_path = f"{scraper_path}.bak.{datetime.now().strftime('%Y%m%d_%H%M%S')}"
    try:
        shutil.copy2(scraper_path, backup_path)
        logging.info(f"Created backup at {backup_path}")
    except Exception as e:
        logging.error(f"Failed to create backup: {e}")
        return False
    
    # Update the scraper
    try:
        predictor = MementoPredictor()
        success = predictor.update_scraper(scraper_path)
        
        if success:
            logging.info(f"Successfully updated {scraper_path}")
            return True
        else:
            logging.error(f"Failed to update {scraper_path}")
            return False
    except Exception as e:
        logging.error(f"Error updating scraper: {e}")
        
        # Try to restore from backup
        try:
            shutil.copy2(backup_path, scraper_path)
            logging.info(f"Restored from backup {backup_path}")
        except Exception as restore_error:
            logging.error(f"Failed to restore from backup: {restore_error}")
        
        return False

1. Overview

What is MEMENTO?

Overview of the Computation Tools, Methods and Machine Learning Pipeline in MEMENTO

2. Introduction

Computational Framework

🌐 Front End

⚙️ Back End

3. Role of Computation in MEMENTO

Computational Workflow

1. Creation of User Mementos

2. Creation of Public Mementos

3. Mementos on MEMENTO

4. Exploration of Mementos

5. Curation of Mementos

4. Machine Learning Methodology for Public Mementos Generation

🔄 Data Processing

⚙️ Data Preparation

🧠 Model Training

🌐 Data Scraping

✨ Processing & Classification

🔄 Step 1: Data Processing

📥 1. Data Collection and Standardization

🧬 2. Feature Extraction

✅ 3. Quality Control

⚙️ Step 2: Model Architecture Definition

🪓 1. Data Splitting

🛠️ 2. Feature Engineering

🧹 3. Data Preprocessing

🧠 Step 3: Model Training

🌲 1. Category Classification Model

🏷️ 2. Tags Prediction Model

⏳ 3. Duration Estimation Model

🌐 Step 4: Data Scraping

🌐 1. Web Scraping

📝 2. Content Processing

🔍 3. Data Validation

✨ Step 5: Data Processing and Classification

🤖 1. Model Application

🛡️ 2. Quality Control

📦 3. Output Generation

Detailed ML Pipeline Breakdown

Installation and Setup

Overview

System Requirements

Installation Steps

Required Packages

Configuration Files

Verification Steps

Common Issues and Solutions

Step 1: Data Loading and Preparation

Overview

Purpose

Implementation Details

Inputs

Outputs

Complete Implementation

Step 2: Model Architecture Definition

Overview

Purpose

Implementation Details

Inputs

Outputs

Complete Implementation

Step 3: Model Training

Overview

Purpose

Implementation Details

Inputs

Outputs

Model Training Report

Trained Models

Complete Implementation

Step 4: Data Scraping

Overview

Purpose

Implementation Details

Inputs

Outputs

Complete Implementation

Step 5: Process Scraped Data