Role of Machine Learning and Computation in
MEMENTOby Vaibhav Jain

Exploring Urban Data with Machine Learning

by Jonathan E. Stiles in Spring 2025

Computation Design Practices 2024-2025

Graduate School of Architecture, Planning and Preservation, Columbia University

MEMENTO Machine Learning QR
MEMENTO Platform QR
MEMENTO Project Documentation QR

1. Overview

What is MEMENTO?

MEMENTO Cover

MEMENTO is a real-time platform that captures urban experiences happening across the city, aiming to prevent people from drifting into oblivion (the state of being unaware or unconscious of what is happening) and instead transforming those moments into engaging odysseys (a long and eventful or adventurous journey or experience).

Unlike conventional platforms, MEMENTO invites users to look beyond the map — to notice, witness, interact, engage, record, react, and reflect on urban experiences as physical mementos, rather than merely experiencing them through virtual screens.

It empowers users to discover, capture, and engage with the mementos around them during their commutes, turning everyday journeys into moments of exploration, creation, and interaction.

MEMENTO serves as an interaction, intersection, and interplay between the city, its people, and their experiences — a platform:
By the people and the city,
For the people and the city,
Of the people and the city.

Overview of the Computation Tools, Methods and Machine Learning Pipeline in MEMENTO

MEMENTO - the platform

The Machine Learning pipeline for public mementos in MEMENTO is designed to automate the classification and tagging of public content sourced from platforms such as Secret NYC, Reddit, NYC Bucket List, Newsbreak, and more. The objective is to replace rule-based keyword matching with a robust multi-label classification model that leverages existing user-generated mementos as training data.

The pipeline consists of three primary models — Category Classification, Tag Prediction, and Duration Estimation — each trained using supervised learning techniques. Logistic Regression with One-vs-Rest strategy and Random Forest classifiers are employed to predict relevant categories and multiple tags, while Decision Trees handle ordinal duration estimation. Text data undergoes preprocessing, including tokenization, stop word removal, and TF-IDF vectorization to convert descriptions into feature vectors suitable for model training.

By implementing this ML pipeline, MEMENTO enhances the consistency and accuracy of content classification, enabling the platform to transform unstructured public content into structured, contextually enriched mementos that align with its existing taxonomy.

2. Introduction

MEMENTO is not just a platform — it's a computational ecosystem that leverages a wide spectrum of tools, methods, and data-driven processes to transform everyday urban experiences into interactive, real-time narratives. It integrates multiple computational layers, each playing a distinct role in capturing, analyzing, and visualizing the city's overlooked moments.

Computational Framework

MEMENTO's computational framework is divided into two core sections:

🌐 Front End

The user-facing web-app interface that brings urban experiences to life through:

  • 🖥️ Web-App Creation: User profiles, memento capture forms, and dynamic content rendering.
  • 🗺️ Geospatial Mapping: Mapping mementos in real time using Mapbox and Google Maps API.
  • 📊 Data Visualization: Creating interactive, data-rich maps using D3.js, highlighting patterns and clusters.
  • 🔍 Interactive Mapping: User-controlled filters, radius selectors, and category-based memento discovery.
  • 👤 Explorer Profile Creation: User profiles that evolve through collected mementos, creating personalized urban journeys.
  • 🔄 User Interaction & Engagement: Filters, recommendations, and curated lists driven by user behavior.

⚙️ Back End

The computational backbone that processes, stores, and structures data using:

  • 💾 Firebase Cloud Storage: Real-time data storage for media, text, and location data.
  • 🧠 Machine Learning Models: Predictive analysis for recommended mementos based on user behavior and sentiment analysis.
  • 📝 Data Structuring & Input Mapping: Categorizing mementos by type, tag, duration — transforming raw inputs into structured data.
  • ⚡ Memento Analysis & Assignment: Algorithms assign memento categories, tags, and durations based on user inputs and contextual data.
  • 📍 Google Maps API: Geolocation data is layered onto dynamic maps, visualizing where experiences occur and how they're clustered.
  • 🌐 Public Data Scraping: Integrating public datasets from online platforms to supplement user-generated mementos with real-time urban events.

Together, these front-end and back-end components create a cohesive computational ecosystem that transforms urban experiences into interactive, real-time narratives, making the city's overlooked moments discoverable and engaging.

3. Role of Computation in MEMENTO

Computational Workflow

MEMENTO's workflow is structured around the creation, visualization, exploration, and curation of urban experiences, transforming scattered moments into structured datasets. The process includes both user-generated and public mementos, creating a dynamic, interactive map of the city's fleeting encounters.

MEMENTO Core actions

A comprehensive visualization of the MEMENTO platform's computational workflow, illustrating the flow of memento datasets from generation to visualization to exploration, forming the core interaction model of the platform.

1. Creation of User Mementos

Creating of User Mementos Users upload media, add text reflections, and tag locations, creating structured mementos categorized by type, tag, and duration.
  • Media, text, and geolocation inputs.
  • Categorized using predefined lists (categories, tags, duration).
  • Data stored in Firebase as structured memento entries.

2. Creation of Public Mementos

Creating of Public Mementos Public mementos are generated through data scraping and machine learning, integrating citywide events and activities as real-time mementos.
  • Data sourced from public platforms.
  • Machine learning processes data in real time.
  • Structured to align with MEMENTO format.

3. Mementos on MEMENTO

Mementos on MEMENTO All mementos — user and public — are populated on a real-time interactive map, each tagged with location, media, timestamp, and description.
  • Data visualized on a dynamic map.
  • Displays geolocation, media, and descriptive tags.
  • Serves as a playground of real-time urban experiences.

4. Exploration of Mementos

Users explore mementos using filters and settings, discovering experiences by category, tag, duration, and proximity.
  • Filter by categories, tags, duration, and distance.
  • Explore mementos through curated lists and live feed.
  • Discover mementos based on user's current location.

5. Curation of Mementos

Recommendations are generated based on user profiles, interaction history, and data analysis, creating personalized memento lists.
  • Personalized memento curation.
  • Daily, trending, recommended, and nearby mementos.
  • Data-driven recommendations based on user behavior.

4. Machine Learning Methodology for Public Mementos Generation

The MEMENTO ML pipeline is designed as a modular, five-step system that transforms raw public content into well-categorized, ML-enhanced mementos. Each step in the pipeline serves a specific purpose and contributes to the overall goal of automated content classification and enrichment.

1

🔄 Data Processing

Transform raw user mementos into standardized datasets

2

⚙️ Data Preparation

Prepare data for model training

3

🧠 Model Training

Train specialized classification models

4

🌐 Data Scraping

Collect and process public content

5

✨ Processing & Classification

Apply models to classify content

🔄 Step 1: Data Processing

The initial step focuses on transforming raw user mementos into a standardized dataset suitable for machine learning. This step includes:

📥 1. Data Collection and Standardization

  • Reading raw memento data from JSON files
  • Standardizing field names and data formats
  • Handling missing values and data inconsistencies
  • Implementing data validation checks

🧬 2. Feature Extraction

  • Processing text fields (title, description, location)
  • Extracting temporal information
  • Standardizing date formats
  • Handling special characters and formatting

✅ 3. Quality Control

  • Validating data integrity
  • Checking for required fields
  • Ensuring consistent data types
  • Generating processing statistics

⚙️ Step 2: Model Architecture Definition

This step defines the machine learning model architectures used for classifying mementos. It sets up three specialized models for different aspects of memento classification: categories, tags, and duration estimation.

🪓 1. Data Splitting

  • Creating training and validation sets
  • Maintaining balanced category distribution
  • Preserving data integrity
  • Implementing stratified sampling

🛠️ 2. Feature Engineering

  • Creating text feature matrices
  • Implementing TF-IDF vectorization
  • Handling categorical variables
  • Preparing label encoders

🧹 3. Data Preprocessing

  • Text cleaning and normalization
  • Feature scaling
  • Handling missing values
  • Preparing multi-label formats

🧠 Step 3: Model Training

The pipeline implements three specialized models for different classification tasks:

🌲 1. Category Classification Model

  • Type: Random Forest Classifier
  • Features: Title, description, location
  • Output: Category predictions with confidence scores
  • Parameters: n_estimators=100, max_depth=10

🏷️ 2. Tags Prediction Model

  • Type: Multi-label Classification
  • Features: Title, description, location
  • Output: Multiple tag predictions
  • Parameters: threshold=0.3

⏳ 3. Duration Estimation Model

  • Type: Decision Tree Classifier
  • Features: Title, description, location
  • Output: Duration predictions
  • Parameters: max_depth=5

🌐 Step 4: Data Scraping

This step collects and processes public content:

🌐 1. Web Scraping

  • Target: Secret NYC website
  • Content types: Articles, events, locations
  • Extraction methods: BeautifulSoup4
  • Rate limiting and error handling

📝 2. Content Processing

  • Text extraction and cleaning
  • Date parsing and standardization
  • Location extraction
  • Duration estimation

🔍 3. Data Validation

  • Content quality checks
  • Required field validation
  • Format standardization
  • Error logging

✨ Step 5: Data Processing and Classification

The final step applies the trained models to classify scraped content:

🤖 1. Model Application

  • Loading trained models
  • Text preprocessing
  • Feature extraction
  • Prediction generation

🛡️ 2. Quality Control

  • Confidence threshold checks
  • Prediction validation
  • Error handling
  • Report generation

📦 3. Output Generation

  • Creating processed mementos
  • Adding predictions and confidence scores
  • Generating processing reports
  • Saving results in multiple formats

Detailed ML Pipeline Breakdown

Installation and Setup

Overview

This section outlines the installation and setup requirements for running the MEMENTO ML pipeline. Follow these steps to set up your development environment and install all necessary dependencies.

System Requirements

  • Python 3.8 or higher
  • pip (Python package installer)
  • Git (for version control)
  • At least 4GB RAM (8GB recommended)
  • Firebase account and credentials

Installation Steps

# 1. Clone the repository
git clone https://github.com/0209vaibhav/Machine-Learning_MEMENTO.git
cd Machine-Learning_MEMENTO

# 2. Create and activate virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# 3. Install required packages
pip install -r requirements.txt

# 4. Set up Firebase credentials
# Place your Firebase credentials JSON file in:
# ml_pipeline/input/step1_data_loader/firebase_credentials.json

Required Packages

# Core ML and Data Processing
scikit-learn>=1.0.2
pandas>=1.3.0
numpy>=1.21.0
nltk>=3.6.0

# Web Scraping and Data Collection
beautifulsoup4>=4.9.3
requests>=2.26.0
selenium>=4.0.0

# Firebase Integration
firebase-admin>=5.0.0

# Text Processing
spacy>=3.1.0
langdetect>=1.0.9

# Utilities
python-dotenv>=0.19.0
tqdm>=4.62.0
joblib>=1.0.0

Configuration Files

The pipeline requires the following configuration files to be set up:

  • Firebase Credentials:
    {
        "type": "service_account",
        "project_id": "your-project-id",
        "private_key_id": "your-key-id",
        "private_key": "your-private-key",
        "client_email": "your-client-email",
        "client_id": "your-client-id",
        "auth_uri": "https://accounts.google.com/o/oauth2/auth",
        "token_uri": "https://oauth2.googleapis.com/token",
        "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
        "client_x509_cert_url": "your-cert-url"
    }
  • Environment Variables:
    # .env file
    FIREBASE_CREDENTIALS_PATH=ml_pipeline/input/step1_data_loader/firebase_credentials.json
    MODEL_OUTPUT_DIR=ml_pipeline/output
    SCRAPING_INTERVAL=3600  # in seconds
    CONFIDENCE_THRESHOLD=0.4

Verification Steps

After installation, verify your setup by running:

# 1. Verify Python version
python --version  # Should be 3.8 or higher

# 2. Verify package installation
pip list  # Should show all required packages

# 3. Verify Firebase connection
python -c "from firebase_admin import credentials, initialize_app; initialize_app(credentials.Certificate('ml_pipeline/input/step1_data_loader/firebase_credentials.json'))"

# 4. Run test script
python ml_pipeline/tests/test_setup.py

Common Issues and Solutions

  • Firebase Connection Issues:
    • Verify credentials file path and format
    • Check internet connection
    • Ensure Firebase project is active
  • Package Installation Issues:
    • Update pip: python -m pip install --upgrade pip
    • Clear pip cache: pip cache purge
    • Install packages individually if needed
  • Memory Issues:
    • Reduce batch size in model training
    • Use smaller model architectures
    • Enable garbage collection

Step 1: Data Loading and Preparation

Overview

This step is responsible for loading user mementos from Firebase and preparing them for ML training. It serves as the foundation for our machine learning pipeline by transforming raw user data into a structured format suitable for model training.

Purpose

The main objectives of this step are:

  • Load user mementos from Firebase database
  • Clean and normalize text data
  • Extract relevant features for ML training
  • Prepare data in a format suitable for model training

Implementation Details

The implementation uses the following key components:

  • MementoDataLoader Class: Main class handling data loading and processing
  • Text Preprocessing: Using NLTK for text cleaning and normalization
  • Feature Extraction: Combining multiple text fields for better ML features
  • Data Validation: Handling missing values and data type conversions

Inputs

The step requires the following inputs:

  • Firebase credentials (JSON file)
    {
        "type": "service_account",
        "project_id": "your-project-id",
        "private_key_id": "your-key-id",
        "private_key": "your-private-key",
        "client_email": "your-client-email",
        "client_id": "your-client-id",
        "auth_uri": "https://accounts.google.com/o/oauth2/auth",
        "token_uri": "https://oauth2.googleapis.com/token",
        "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
        "client_x509_cert_url": "your-cert-url"
    }
  • Categories metadata (JSON file)
    [
      {
        "id": "architecture",
        "name": "Architecture",
        "symbol": "🏛️",
        "keywords": ["building", "structure", "design", "architecture", "landmark", "historic", "monument", "architectural"]
      },
      {
        "id": "urban-nature",
        "name": "Urban Nature",
        "symbol": "🌿",
        "keywords": ["park", "garden", "nature", "green space", "plants", "trees", "urban nature", "outdoor"]
      },
      // ... more categories ...
    ]
  • Tags metadata (JSON file)
    [
      {
        "id": "ephemeral",
        "name": "Ephemeral",
        "symbol": "🌀",
        "keywords": ["temporary", "fleeting", "short-lived", "momentary", "brief", "passing", "transient", "ephemeral"]
      },
      {
        "id": "unmapped",
        "name": "Unmapped",
        "symbol": "📍",
        "keywords": ["hidden", "undiscovered", "secret", "unknown", "unexplored", "off-map", "unmapped", "new"]
      },
      // ... more tags ...
    ]
  • Durations metadata (JSON file)
    [
      {
        "id": "15min",
        "name": "15 Minutes",
        "symbol": "⚡",
        "keywords": ["15 minutes", "quarter hour", "quick stop", "brief moment", "passing", "fleeting"]
      },
      {
        "id": "1hr",
        "name": "1 Hour",
        "symbol": "⏱️",
        "keywords": ["1 hour", "one hour", "hour long", "60 minutes", "quick visit"]
      },
      // ... more durations ...
    ]

Outputs

The step produces the following outputs:

Loading outputs metadata...

Complete Implementation

"""
Step 1: Data Loading and Preparation

This module is responsible for loading user mementos from Firebase for ML training.
These user mementos will be used as training data for the ML models.
"""

import json
import os
import pandas as pd
import numpy as np
from typing import List, Dict, Optional, Tuple
import logging
from sklearn.preprocessing import MultiLabelBinarizer
import firebase_admin
from firebase_admin import credentials, firestore
from datetime import datetime
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re

class MementoDataLoader:
    def __init__(self, 
                 categories_path: str = None,
                 tags_path: str = None,
                 durations_path: str = None,
                 firebase_credentials_path: str = None):
        """Initialize the data loader"""
        self.categories = {}
        self.tags = {}
        self.durations = {}
        self.db = None
        self.lemmatizer = WordNetLemmatizer()
        self.stop_words = set(stopwords.words('english'))
        
        # Load metadata if paths provided
        if categories_path:
            self.load_categories(categories_path)
        if tags_path:
            self.load_tags(tags_path)
        if durations_path:
            self.load_durations(durations_path)
            
        # Initialize Firebase if credentials provided
        if firebase_credentials_path:
            self.firebase_credentials_path = firebase_credentials_path
            self._initialize_firebase()

    def _preprocess_text(self, text: str) -> str:
        """Preprocess text for ML"""
        if not isinstance(text, str):
            text = str(text)
        text = text.lower()
        text = re.sub(r'[^a-z\s]', ' ', text)
        text = ' '.join(text.split())
        return text

    def _extract_text_for_ml(self, memento: Dict) -> str:
        """Extract and preprocess text for ML from memento"""
        text_parts = []
        name = memento.get('name', '')
        text_parts.extend([name] * 3)
        description = memento.get('description', '')
        text_parts.append(description)
        location = memento.get('location', {})
        if isinstance(location, dict):
            location_str = str(location)
            text_parts.append(location_str)
        combined_text = ' '.join(filter(None, text_parts))
        return self._preprocess_text(combined_text)

    def prepare_training_data(self, df: pd.DataFrame, output_dir: str) -> Dict[str, str]:
        """Prepare data for training"""
        os.makedirs(output_dir, exist_ok=True)
        processed_data = []
        for _, memento in df.iterrows():
            memento_dict = memento.to_dict()
            text_for_ml = self._extract_text_for_ml(memento_dict)
            memento_dict['text_for_ml'] = text_for_ml
            processed_data.append(memento_dict)
        
        processed_df = pd.DataFrame(processed_data)
        output_path = os.path.join(output_dir, 'user_mementos_processed.csv')
        processed_df.to_csv(output_path, index=False)
        return {'processed_data': output_path}

Step 2: Model Architecture Definition

Overview

This step defines the machine learning model architectures used for classifying mementos. It sets up three specialized models for different aspects of memento classification: categories, tags, and duration estimation.

Purpose

The main objectives of this step are:

  • Define model architectures for category classification
  • Set up multi-label classification for tag prediction
  • Create duration estimation model
  • Configure model hyperparameters and evaluation metrics

Implementation Details

The implementation includes three main models:

  • Category Classifier:
    • Multi-class classification model
    • Uses TF-IDF vectorization
    • RandomForestClassifier with 200 estimators
  • Tag Predictor:
    • Multi-label classification model
    • OneVsRestClassifier with LinearSVC
    • Handles multiple tags per memento
  • Duration Estimator:
    • Multi-class classification model
    • Maps text to duration categories
    • Uses RandomForestClassifier

Inputs

The step requires the following inputs:

  • Processed data from Step 1 (CSV file)
  • Categories metadata (JSON file)
  • Tags metadata (JSON file)
  • Durations metadata (JSON file)
# Example input data structure
{
    "text_for_ml": "processed text content",
    "category": "category_id",
    "tags": ["tag1", "tag2"],
    "duration": "duration_value"
}

Outputs

The step produces the following outputs:

  • 1. Model Information (model_info.json)
  • 2. Model Metrics (model_metrics.json)

Complete Implementation

"""
Step 2: Model Architecture Definition

This module defines the ML model architectures used for classifying mementos.
These models will be trained in Step 3 using user mementos from Firebase.
"""

import os
import json
import pickle
import numpy as np
import pandas as pd
import logging
from typing import Dict, List, Tuple, Optional
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, accuracy_score, f1_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MultiLabelBinarizer

class MementoModelTrainer:
    def __init__(self, 
                categories_path: str, 
                tags_path: str,
                durations_path: str,
                output_dir: str = "."):
        """Initialize the model trainer"""
        self.categories_path = categories_path
        self.tags_path = tags_path
        self.durations_path = durations_path
        self.output_dir = output_dir
        
        # Create output directory
        os.makedirs(output_dir, exist_ok=True)
        
        # Load metadata
        self.categories, self.category_ids = self._load_categories()
        self.tags, self.tag_ids = self._load_tags()
        self.durations, self.duration_ids = self._load_durations()
        
        # Initialize models
        self.vectorizer = None
        self.category_model = None
        self.tags_model = None
        self.duration_model = None

    def train_models(self, 
                     data_path: str, 
                     test_size: float = 0.2, 
                     random_state: int = 42,
                     use_grid_search: bool = False) -> Dict:
        """Train category, tag, and duration models"""
        # Load and prepare data
        X, y_category, y_tags, y_duration = self.load_data(data_path)
        
        # Create text vectorizer
        self.vectorizer = TfidfVectorizer(
            max_features=5000,
            min_df=2,
            max_df=0.8,
            ngram_range=(1, 2),
            stop_words='english'
        )
        
        # Train models
        self._train_category_model(X, y_category)
        self._train_tags_model(X, y_tags)
        self._train_duration_model(X, y_duration)
        
        # Evaluate and save models
        metrics = self._evaluate_models(X, y_category, y_tags, y_duration)
        self._save_models()
        self.save_model_info(metrics)
        
        return metrics

Step 3: Model Training

Overview

This step trains the machine learning models defined in Step 2 using the processed user mementos from Step 1. It includes comprehensive model training, evaluation, and validation processes.

Purpose

The main objectives of this step are:

  • Train category classification model
  • Train multi-label tag prediction model
  • Train duration estimation model
  • Evaluate model performance using various metrics
  • Save trained models and evaluation results

Implementation Details

The implementation includes several key components:

  • Hyperparameter Tuning:
    • Grid search for optimal parameters
    • Cross-validation for model validation
    • Handling of small datasets and multi-label cases
  • Model Training:
    • Category model training with class balancing
    • Tag model training with multi-label support
    • Duration model training with custom metrics
  • Model Evaluation:
    • Accuracy and F1 score calculation
    • Classification reports generation
    • Confusion matrix analysis

Inputs

The step requires the following inputs:

  • Processed data from Step 1
  • Model architectures from Step 2
  • Hyperparameter configurations
# Example hyperparameter grid
param_grid = {
    'C': [0.1, 1, 10],
    'max_iter': [1000],
    'class_weight': ['balanced'],
    'solver': ['liblinear', 'saga']
}

Outputs

The step produces the following outputs:

  • Trained model files (pickle format)
  • Model evaluation metrics
  • Training logs
Model Training Report
{
    "training_date": "2024-03-15",
    "model_versions": {
        "category_classifier": "v1.2.0",
        "tag_predictor": "v1.1.0",
        "duration_estimator": "v1.0.1"
    },
    "training_metrics": {
        "category_classification": {
            "accuracy": 0.8132,
            "f1_score": 0.7945,
            "precision": 0.8023,
            "recall": 0.7869,
            "training_samples": 1250,
            "validation_samples": 312
        },
        "tag_prediction": {
            "micro_f1": 0.8895,
            "macro_f1": 0.8567,
            "hamming_loss": 0.1105,
            "training_samples": 1250,
            "validation_samples": 312
        },
        "duration_estimation": {
            "accuracy": 0.7234,
            "mean_absolute_error": 0.4567,
            "training_samples": 1250,
            "validation_samples": 312
        }
    },
    "hyperparameters": {
        "category_classifier": {
            "model_type": "RandomForestClassifier",
            "n_estimators": 200,
            "max_depth": 10,
            "min_samples_split": 5
        },
        "tag_predictor": {
            "model_type": "MultiLabelClassifier",
            "threshold": 0.3,
            "base_estimator": "LinearSVC"
        },
        "duration_estimator": {
            "model_type": "DecisionTreeClassifier",
            "max_depth": 5,
            "min_samples_leaf": 10
        }
    },
    "training_time": {
        "data_preparation": "45.2s",
        "model_training": "189.7s",
        "validation": "23.4s",
        "total": "258.3s"
    },
    "validation_notes": [
        "Category classification shows strong performance on major categories",
        "Tag prediction achieves high micro-F1 score indicating good overall accuracy",
        "Duration estimation needs improvement, particularly for longer durations",
        "Model size optimized for production deployment"
    ]
}
Trained Models
  • Vectorizer: 01_vectorizer.pkl
  • Category Models:
    • 01_category_model.pkl
    • 02_category_model.pkl
  • Tags Model: 02_tags_model.pkl
  • Duration Model: 02_duration_model.pkl

Complete Implementation

"""
Step 3: Model Training

This module trains the ML models defined in Step 2 using user mementos from Step 1.
The trained models will be used to classify scraped data in Step 5.
"""

import os
import json
import logging
import pandas as pd
import numpy as np
from typing import Dict, List, Tuple, Optional
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score
from sklearn.preprocessing import MultiLabelBinarizer
import pickle
from datetime import datetime

class MementoModelTrainer:
    def __init__(self, 
                 input_dir: str = "ml_pipeline/output/step1_data_processing/processed_data",
                 output_dir: str = "ml_pipeline/output/step3_model_training"):
        """Initialize the trainer"""
        self.input_dir = input_dir
        self.output_dir = output_dir
        self.models_dir = os.path.join(output_dir, "models")
        self.metrics_dir = os.path.join(output_dir, "metrics")
        self.reports_dir = os.path.join(output_dir, "reports")
        
        # Create output directories
        os.makedirs(self.models_dir, exist_ok=True)
        os.makedirs(self.metrics_dir, exist_ok=True)
        os.makedirs(self.reports_dir, exist_ok=True)
        
        # Define hyperparameter grids
        self.param_grid = {
            'C': [0.1, 1, 10],
            'max_iter': [1000],
            'class_weight': ['balanced'],
            'solver': ['liblinear', 'saga']
        }

    def train_models(self):
        """Train all models"""
        # Load and prepare data
        df = self.load_training_data()
        X, y_dict = self.prepare_training_data(df)
        
        # Train models
        category_metrics = self.train_category_model(X, y_dict['category'])
        tags_metrics = self.train_tags_model(X, y_dict['tags'])
        duration_metrics = self.train_duration_model(X, y_dict['duration'])
        
        # Save vectorizer
        self.save_vectorizer()
        
        # Generate and save reports
        metrics = {
            'category': category_metrics,
            'tags': tags_metrics,
            'duration': duration_metrics
        }
        self.generate_model_report(metrics)

Step 4: Data Scraping

Overview

This step scrapes data from Secret NYC to create a testing dataset for our ML models. It collects public content that will be processed and classified in subsequent steps.

Purpose

The main objectives of this step are:

  • Scrape articles from Secret NYC website
  • Extract relevant information from articles
  • Clean and structure the scraped data
  • Prepare data for ML classification

Implementation Details

The implementation includes several key components:

  • Web Scraping:
    • Robust session management with retry strategy
    • BeautifulSoup for HTML parsing
    • Error handling and logging
  • Data Extraction:
    • Title and description extraction
    • Location geocoding using OpenStreetMap
    • Duration extraction using regex patterns
    • Date parsing and formatting
  • Data Cleaning:
    • Text cleaning and formatting
    • Language detection and filtering
    • Media URL extraction and validation

Inputs

The step requires the following inputs:

  • Secret NYC website URL
  • Output directory configuration
# Example configuration
base_url = "https://secretnyc.co/things-to-do/"
output_dir = "ml_pipeline/output/step4_scraped_data"

Outputs

The step produces the following outputs:

  • Raw scraped data in JSON format
  • Structured memento objects
  • Scraping logs
Loading scraped data outputs...

Complete Implementation

"""
Step 4: Scrape Data

This module scrapes data from Secret NYC to create a testing dataset.
"""

import requests
from bs4 import BeautifulSoup
import json
import os
import re
import time
from datetime import datetime
from langdetect import detect
from typing import Dict, List, Optional
import logging
from requests.adapters import HTTPAdapter
from urllib3.util import Retry

class SecretNYCScraper:
    def __init__(self, 
                 base_url: str = "https://secretnyc.co/things-to-do/",
                 output_dir: str = "ml_pipeline/output/step4_scraped_data"):
        """Initialize scraper"""
        self.base_url = base_url
        self.output_dir = output_dir
        self.raw_data_dir = os.path.join(output_dir, "raw_data")
        os.makedirs(self.raw_data_dir, exist_ok=True)
        self.default_user = "Secret NYC"

    def create_session(self) -> requests.Session:
        """Create session with retry strategy"""
        session = requests.Session()
        retry_strategy = Retry(
            total=3,
            backoff_factor=1,
            status_forcelist=[429, 500, 502, 503, 504],
        )
        adapter = HTTPAdapter(max_retries=retry_strategy)
        session.mount("http://", adapter)
        session.mount("https://", adapter)
        session.headers.update({
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        })
        return session

    def scrape_article(self, article_url: str) -> Optional[Dict]:
        """Scrape a single article"""
        session = self.create_session()
        try:
            response = session.get(article_url, timeout=10)
            response.raise_for_status()
            soup = BeautifulSoup(response.text, "html.parser")
            
            # Extract article data
            title = soup.find("h1").get_text(strip=True) if soup.find("h1") else "Untitled"
            paragraphs = soup.select("section.article__body p")
            desc = "\n".join(p.get_text(strip=True) for p in paragraphs if p.get_text(strip=True))
            
            # Process and clean data
            cleaned = self.clean_description(desc)
            location = self.extract_fallback_location(desc, title)
            duration = self.extract_duration(desc)
            
            # Create memento object
            memento = {
                "userId": self.default_user,
                "location": self.geocode_location(location) if location else {},
                "media": self.extract_media(soup),
                "name": title,
                "description": cleaned,
                "category": "Other",
                "timestamp": self.parse_date(self.extract_date(soup)),
                "tags": ["Other"],
                "link": article_url,
                "mementoType": "public",
                "duration": duration
            }
            
            return memento
            
        except Exception as e:
            logging.error(f"Error scraping {article_url}: {e}")
            return None

Step 5: Process Scraped Data

Overview

This step processes the scraped data from Step 4 through the trained models from Step 3. It classifies each scraped memento into categories, tags, and durations, applying quality control measures to ensure accurate predictions.

Purpose

The main objectives of this step are:

  • Process scraped data through trained ML models
  • Classify mementos into categories, tags, and durations
  • Validate prediction confidence and quality
  • Generate processing reports and statistics

Implementation Details

The implementation includes several key components:

  • Model Integration:
    • Loading trained models from Step 3
    • Text vectorization and preprocessing
    • Multi-model prediction pipeline
  • Quality Control:
    • Confidence threshold validation
    • Prediction quality checks
    • Filtering low-confidence predictions
  • Data Processing:
    • Text feature extraction
    • Multi-label tag prediction
    • Duration estimation

Inputs

The step requires the following inputs:

  • Scraped data from Step 4 (JSON format)
  • Trained models from Step 3 (pickle files)
  • Confidence thresholds configuration
# Example confidence thresholds
confidence_thresholds = {
    "category": 0.4,
    "tags": 0.3,
    "duration": 0.4
}

Outputs

The step produces the following outputs:

  • Processed data with predictions (CSV, JSON):
      Loading processed CSV...
      Loading processed JSON...
  • Processing report (JSON):
      Loading processing report...
  • Validation logs and outputs (directory):
    • ml_pipeline/output/step5_processed_data/validation/ (for validation logs and future outputs)

Complete Implementation

"""
Step 5: Process Scraped Data

This module processes scraped data from Step 4 through trained models from Step 3.
It classifies each scraped memento into categories, tags, and durations.
"""

import json
import os
import pandas as pd
import numpy as np
from typing import List, Dict, Optional, Tuple
import logging
import pickle
from sklearn.preprocessing import MultiLabelBinarizer
from datetime import datetime

class ScrapedDataProcessor:
    def __init__(self, 
                 model_dir: str = "ml_pipeline/output/step3_model_training/models",
                 output_dir: str = "ml_pipeline/output/step5_processed_data"):
        """Initialize the processor"""
        self.model_dir = model_dir
        self.output_dir = output_dir
        self.processed_data_dir = os.path.join(output_dir, "processed_data")
        self.reports_dir = os.path.join(output_dir, "reports")
        self.validation_dir = os.path.join(output_dir, "validation")
        
        # Confidence thresholds
        self.confidence_thresholds = {
            "category": 0.4,
            "tags": 0.3,
            "duration": 0.4
        }
        
        # Create output directories
        os.makedirs(self.processed_data_dir, exist_ok=True)
        os.makedirs(self.reports_dir, exist_ok=True)
        os.makedirs(self.validation_dir, exist_ok=True)
        
        # Load models
        self._load_models()

    def process_memento(self, memento: Dict) -> Dict:
        """Process a single memento"""
        # Extract text for ML
        text = self._extract_text_for_ml(memento)
        
        # Vectorize text
        text_vec = self.vectorizer.transform([text])
        
        # Get predictions
        category_pred, category_conf = self._predict_category(text_vec)
        tags_pred, tags_conf = self._predict_tags(text_vec)
        duration_pred, duration_conf = self._predict_duration(text_vec)
        
        # Validate predictions
        validation = self._validate_predictions(
            category_pred, category_conf,
            tags_pred, tags_conf,
            duration_pred, duration_conf
        )
        
        # Update memento with predictions
        memento.update({
            "category": category_pred,
            "category_confidence": category_conf,
            "tags": tags_pred,
            "tags_confidence": tags_conf,
            "duration": duration_pred,
            "duration_confidence": duration_conf,
            "validation": validation
        })
        
        return memento

Step 6: ML Model Predictor

Overview

This step provides a production-ready predictor class that uses trained ML models to classify new mementos with categories, tags, and durations. It's designed for seamless integration with the scraper and other components of the system.

Purpose

The main objectives of this step are:

  • Provide a production-ready ML prediction interface
  • Classify mementos with categories, tags, and durations
  • Support batch prediction capabilities
  • Enable seamless integration with other components

Implementation Details

The implementation includes several key components:

  • Model Management:
    • TF-IDF vectorizer loading
    • Category classifier integration
    • Multi-label tag predictor
    • Duration estimator
  • Prediction Features:
    • Configurable prediction thresholds
    • Batch prediction support
    • Error handling and logging
    • Scraper integration utilities
  • Production Features:
    • Model versioning
    • Performance optimization
    • Memory management
    • Robust error handling

Inputs

The step requires the following inputs:

  • Trained models from Step 3 (pickle files)
  • Category definitions (JSON)
  • Tag definitions (JSON)
  • Duration definitions (JSON)
# Example model paths
models_dir = "ml_pipeline/models"
categories_path = "memento_categories_combined.json"
tags_path = "memento_tags_combined.json"
durations_path = "memento_durations.json"

Outputs

The step produces the following outputs:

  • Category predictions with confidence scores
  • Multi-label tag predictions
  • Duration estimates
  • Complete memento classifications
# Example prediction output
{
    "category": "🌳 Outdoors",
    "tags": ["🌅 Sunset", "🎵 Music", "🏞️ Parks"],
    "duration": "1-2 hours",
    "confidence_scores": {
        "category": 0.85,
        "tags": [0.75, 0.65, 0.60],
        "duration": 0.80
    }
}

Complete Implementation

"""
Step 6: ML Model Predictor

This module provides a production-ready predictor class that uses trained ML models
to classify new mementos with categories, tags, and durations.
"""

import os
import json
import pickle
import logging
import numpy as np
from typing import List, Dict, Optional, Union

class MementoPredictor:
    def __init__(self, 
                 models_dir: Optional[str] = None,
                 categories_path: Optional[str] = None,
                 tags_path: Optional[str] = None,
                 durations_path: Optional[str] = None,
                 threshold: float = 0.2):
        """Initialize the predictor with trained models"""
        # Set up paths
        self.script_dir = os.path.dirname(os.path.abspath(__file__))
        self.root_dir = os.path.dirname(self.script_dir)
        
        # Set models_dir if not provided
        if models_dir is None:
            models_dir = os.path.join(self.script_dir, "models")
        self.models_dir = models_dir
        
        # Set paths for category, tag, and duration definitions
        self.categories_path = categories_path or os.path.join(self.root_dir, "memento_categories_combined.json")
        self.tags_path = tags_path or os.path.join(self.root_dir, "memento_tags_combined.json")
        self.durations_path = durations_path or os.path.join(self.root_dir, "memento_durations.json")
        
        # Set threshold
        self.threshold = threshold
        
        # Initialize models and data
        self.vectorizer = None
        self.category_model = None
        self.tags_model = None
        self.duration_model = None
        self.categories = None
        self.tags = None
        self.durations = None
        
        # Load models and data
        self._load_categories_and_tags()
        self._load_models()

    def classify_memento(self, 
                        description: str, 
                        context: Optional[Dict] = None) -> Dict[str, Union[str, List[str]]]:
        """Classify a memento with categories, tags, and duration"""
        # Get predictions
        category = self.predict_category(description)
        tags = self.predict_tags(description)
        duration = self.predict_duration(description)
        
        # Return complete classification
        return {
            "category": category,
            "tags": tags,
            "duration": duration
        }

Step 7: ML Pipeline Integration

Overview

This step provides tools to integrate the trained ML models with production systems, particularly focusing on updating existing scrapers to use ML-based classification. It includes robust safety features and comprehensive integration tools.

Purpose

The main objectives of this step are:

  • Integrate ML models with production systems
  • Update existing scrapers to use ML classification
  • Ensure safe and reliable integration
  • Provide monitoring and maintenance tools

Implementation Details

The implementation includes several key components:

  • Scraper Integration:
    • Automatic scraper detection
    • Code analysis and validation
    • Safe code modification
    • Backup creation
    • Rollback capabilities
  • Safety Features:
    • Dry run mode for testing
    • Automatic backups
    • Validation checks
    • Error recovery
    • Logging and monitoring
  • Integration Tools:
    • Command-line interface
    • Scraper analysis
    • Code modification utilities
    • Import management
    • Path resolution

Inputs

The step requires the following inputs:

  • Scraper file path or search directory
  • ML predictor from Step 6
  • Integration configuration
# Example command-line usage
python step7_integration.py --scraper path/to/scraper.py
python step7_integration.py --find --search-dir path/to/search
python step7_integration.py --scraper path/to/scraper.py --dryrun

Outputs

The step produces the following outputs:

  • Updated scraper files
  • Backup files
  • Integration logs
  • Analysis reports
# Example analysis output
{
    "path": "path/to/scraper.py",
    "has_assign_category": true,
    "has_assign_tags": true,
    "already_using_ml": false,
    "imports": ["import requests", "from bs4 import BeautifulSoup"],
    "size": 1024,
    "lines": 50
}

Complete Implementation

"""
Step 7: ML Pipeline Integration

This module provides tools to integrate the trained ML models with production systems,
particularly focusing on updating existing scrapers to use ML-based classification.
"""

import os
import sys
import argparse
import logging
import shutil
from datetime import datetime
from typing import Tuple

# Import the predictor
from step6_predictor import MementoPredictor

def find_scrapers(search_dir: str = None) -> list:
    """Find all potential scraper files in the given directory"""
    if search_dir is None:
        search_dir = os.path.dirname(script_dir)
    
    potential_scrapers = []
    
    # Search patterns that indicate a scraper file
    patterns = [
        "scrape_all_pages",
        "scrape_",
        "_scraper",
        "crawler",
    ]
    
    # Walk through the directory
    for root, _, files in os.walk(search_dir):
        for file in files:
            if file.endswith(".py"):
                file_path = os.path.join(root, file)
                
                # Check if filename matches any pattern
                if any(pattern in file.lower() for pattern in patterns):
                    potential_scrapers.append(file_path)
                else:
                    # Check file contents for key functions
                    try:
                        with open(file_path, "r", encoding="utf-8") as f:
                            content = f.read()
                            if "def assign_category" in content and "def assign_tags" in content:
                                potential_scrapers.append(file_path)
                    except Exception:
                        pass
    
    return potential_scrapers

def update_scraper(scraper_path: str, dryrun: bool = False) -> bool:
    """Update a scraper file to use ML-based classification"""
    # First analyze the scraper
    can_update, info = analyze_scraper(scraper_path)
    
    if not can_update:
        logging.error(f"Cannot update {scraper_path}: {info.get('error', 'Missing required functions')}")
        return False
    
    if info.get("already_using_ml", False):
        logging.warning(f"Scraper {scraper_path} is already using ML predictor")
        return False
    
    if dryrun:
        logging.info("DRY RUN: Would update scraper with ML predictor")
        return True
    
    # Create a backup
    backup_path = f"{scraper_path}.bak.{datetime.now().strftime('%Y%m%d_%H%M%S')}"
    try:
        shutil.copy2(scraper_path, backup_path)
        logging.info(f"Created backup at {backup_path}")
    except Exception as e:
        logging.error(f"Failed to create backup: {e}")
        return False
    
    # Update the scraper
    try:
        predictor = MementoPredictor()
        success = predictor.update_scraper(scraper_path)
        
        if success:
            logging.info(f"Successfully updated {scraper_path}")
            return True
        else:
            logging.error(f"Failed to update {scraper_path}")
            return False
    except Exception as e:
        logging.error(f"Error updating scraper: {e}")
        
        # Try to restore from backup
        try:
            shutil.copy2(backup_path, scraper_path)
            logging.info(f"Restored from backup {backup_path}")
        except Exception as restore_error:
            logging.error(f"Failed to restore from backup: {restore_error}")
        
        return False

Pipeline Results: Generated Memento Datasets

Overview

After running the complete MEMENTO ML pipeline, we have generated rich, structured datasets of public mementos scraped from Secret NYC. These datasets represent real urban experiences, now classified and ready to be visualized as interactive markers in the MEMENTO web-app. Each dataset below corresponds to a different theme or category, providing a foundation for exploration, discovery, and engagement within the platform.

Public Mementos Generated (by Dataset)

Each card below represents a dataset of public mementos, grouped by theme. These are the actual lists used to populate the public mementos map and features in the MEMENTO web-app.

🎭 Culture (Loading...)

Loading culture mementos...

🏃 Escapes (Loading...)

Loading escapes mementos...

🍽️ Food & Drink (Loading...)

Loading food & drink mementos...

🎯 Things to Do (Loading...)

Loading things to do mementos...

📰 Top News (Loading...)

Loading top news mementos...

🌿 Wellness & Nature (Loading...)

Loading wellness & nature mementos...

Current Results

Setbacks and Challenges

  1. Data Quality Issues
    • Limited Training Data
      • Small dataset of user mementos
      • Imbalanced distribution across categories
      • Insufficient examples for rare activities
    • Data Consistency
      • Inconsistent text formatting
      • Varying levels of detail in descriptions
      • Mixed language usage (English + local terms)
  2. Model Limitations
    • Text Processing
      • Loss of context in short descriptions
      • Difficulty with slang and informal language
      • Challenges with location-specific terminology
    • Feature Engineering
      • Limited use of temporal features
      • Underutilization of location data
      • Basic text preprocessing pipeline
  3. Integration Challenges
    • System Integration
      • Complex integration with existing scrapers
      • Performance overhead in production
      • Real-time prediction latency
    • Scalability Issues
      • Memory constraints with large models
      • Processing time for batch predictions
      • Resource limitations in production

Why Results Are Not Optimal

Improvement Strategies

  1. Data Enhancement
    • Data Collection
      • Expand training dataset
      • Balance category distribution
      • Include more diverse examples
    • Data Quality
      • Implement better text preprocessing
      • Standardize input formats
      • Add data validation steps
  2. Model Improvements
    • Architecture
      • Implement transformer-based models
      • Add ensemble methods
      • Use advanced NLP techniques
    • Training
      • Implement cross-validation
      • Add hyperparameter tuning
      • Use transfer learning
  3. Feature Engineering
    • Text Features
      • Add semantic analysis
      • Implement context-aware processing
      • Use word embeddings
    • Additional Features
      • Incorporate temporal patterns
      • Add location-based features
      • Include user behavior data
  4. System Optimization
    • Performance
      • Implement model quantization
      • Add caching mechanisms
      • Optimize prediction pipeline
    • Scalability
      • Add batch processing
      • Implement distributed computing
      • Optimize resource usage

Conclusion

While our current results show promise (81.32% category accuracy and 88.95% tag F1-score), there's significant room for improvement. The main challenges stem from data limitations and basic model architecture. However, with the proposed improvements in data quality, model architecture, and system optimization, we expect to achieve significantly better performance in the coming months.

Application of Computation and Machine Learning Methods in MEMENTO