Exploring Urban Data with Machine Learning
by Jonathan E. Stiles in Spring 2025
Computation Design Practices 2024-2025
Graduate School of Architecture, Planning and Preservation, Columbia University
MEMENTO is a real-time platform that captures urban experiences happening across the city, aiming to prevent people from drifting into oblivion (the state of being unaware or unconscious of what is happening) and instead transforming those moments into engaging odysseys (a long and eventful or adventurous journey or experience).
Unlike conventional platforms, MEMENTO invites users to look beyond the map — to notice, witness, interact, engage, record, react, and reflect on urban experiences as physical mementos, rather than merely experiencing them through virtual screens.
It empowers users to discover, capture, and engage with the mementos around them during their commutes, turning everyday journeys into moments of exploration, creation, and interaction.
MEMENTO serves as an interaction, intersection, and interplay between the city, its people, and their experiences — a platform:
By the people and the city,
For the people and the city,
Of the people and the city.
The Machine Learning pipeline for public mementos in MEMENTO is designed to automate the classification and tagging of public content sourced from platforms such as Secret NYC, Reddit, NYC Bucket List, Newsbreak, and more. The objective is to replace rule-based keyword matching with a robust multi-label classification model that leverages existing user-generated mementos as training data.
The pipeline consists of three primary models — Category Classification, Tag Prediction, and Duration Estimation — each trained using supervised learning techniques. Logistic Regression with One-vs-Rest strategy and Random Forest classifiers are employed to predict relevant categories and multiple tags, while Decision Trees handle ordinal duration estimation. Text data undergoes preprocessing, including tokenization, stop word removal, and TF-IDF vectorization to convert descriptions into feature vectors suitable for model training.
By implementing this ML pipeline, MEMENTO enhances the consistency and accuracy of content classification, enabling the platform to transform unstructured public content into structured, contextually enriched mementos that align with its existing taxonomy.
MEMENTO is not just a platform — it's a computational ecosystem that leverages a wide spectrum of tools, methods, and data-driven processes to transform everyday urban experiences into interactive, real-time narratives. It integrates multiple computational layers, each playing a distinct role in capturing, analyzing, and visualizing the city's overlooked moments.
MEMENTO's computational framework is divided into two core sections:
The user-facing web-app interface that brings urban experiences to life through:
The computational backbone that processes, stores, and structures data using:
Together, these front-end and back-end components create a cohesive computational ecosystem that transforms urban experiences into interactive, real-time narratives, making the city's overlooked moments discoverable and engaging.
MEMENTO's workflow is structured around the creation, visualization, exploration, and curation of urban experiences, transforming scattered moments into structured datasets. The process includes both user-generated and public mementos, creating a dynamic, interactive map of the city's fleeting encounters.
A comprehensive visualization of the MEMENTO platform's computational workflow, illustrating the flow of memento datasets from generation to visualization to exploration, forming the core interaction model of the platform.
The MEMENTO ML pipeline is designed as a modular, five-step system that transforms raw public content into well-categorized, ML-enhanced mementos. Each step in the pipeline serves a specific purpose and contributes to the overall goal of automated content classification and enrichment.
Transform raw user mementos into standardized datasets
Prepare data for model training
Train specialized classification models
Collect and process public content
Apply models to classify content
The initial step focuses on transforming raw user mementos into a standardized dataset suitable for machine learning. This step includes:
This step defines the machine learning model architectures used for classifying mementos. It sets up three specialized models for different aspects of memento classification: categories, tags, and duration estimation.
The pipeline implements three specialized models for different classification tasks:
This step collects and processes public content:
The final step applies the trained models to classify scraped content:
This section outlines the installation and setup requirements for running the MEMENTO ML pipeline. Follow these steps to set up your development environment and install all necessary dependencies.
# 1. Clone the repository
git clone https://github.com/0209vaibhav/Machine-Learning_MEMENTO.git
cd Machine-Learning_MEMENTO
# 2. Create and activate virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# 3. Install required packages
pip install -r requirements.txt
# 4. Set up Firebase credentials
# Place your Firebase credentials JSON file in:
# ml_pipeline/input/step1_data_loader/firebase_credentials.json
# Core ML and Data Processing
scikit-learn>=1.0.2
pandas>=1.3.0
numpy>=1.21.0
nltk>=3.6.0
# Web Scraping and Data Collection
beautifulsoup4>=4.9.3
requests>=2.26.0
selenium>=4.0.0
# Firebase Integration
firebase-admin>=5.0.0
# Text Processing
spacy>=3.1.0
langdetect>=1.0.9
# Utilities
python-dotenv>=0.19.0
tqdm>=4.62.0
joblib>=1.0.0
The pipeline requires the following configuration files to be set up:
{
"type": "service_account",
"project_id": "your-project-id",
"private_key_id": "your-key-id",
"private_key": "your-private-key",
"client_email": "your-client-email",
"client_id": "your-client-id",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "your-cert-url"
}
# .env file
FIREBASE_CREDENTIALS_PATH=ml_pipeline/input/step1_data_loader/firebase_credentials.json
MODEL_OUTPUT_DIR=ml_pipeline/output
SCRAPING_INTERVAL=3600 # in seconds
CONFIDENCE_THRESHOLD=0.4
After installation, verify your setup by running:
# 1. Verify Python version
python --version # Should be 3.8 or higher
# 2. Verify package installation
pip list # Should show all required packages
# 3. Verify Firebase connection
python -c "from firebase_admin import credentials, initialize_app; initialize_app(credentials.Certificate('ml_pipeline/input/step1_data_loader/firebase_credentials.json'))"
# 4. Run test script
python ml_pipeline/tests/test_setup.py
python -m pip install --upgrade pip
pip cache purge
This step is responsible for loading user mementos from Firebase and preparing them for ML training. It serves as the foundation for our machine learning pipeline by transforming raw user data into a structured format suitable for model training.
The main objectives of this step are:
The implementation uses the following key components:
The step requires the following inputs:
{
"type": "service_account",
"project_id": "your-project-id",
"private_key_id": "your-key-id",
"private_key": "your-private-key",
"client_email": "your-client-email",
"client_id": "your-client-id",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "your-cert-url"
}
[
{
"id": "architecture",
"name": "Architecture",
"symbol": "🏛️",
"keywords": ["building", "structure", "design", "architecture", "landmark", "historic", "monument", "architectural"]
},
{
"id": "urban-nature",
"name": "Urban Nature",
"symbol": "🌿",
"keywords": ["park", "garden", "nature", "green space", "plants", "trees", "urban nature", "outdoor"]
},
// ... more categories ...
]
[
{
"id": "ephemeral",
"name": "Ephemeral",
"symbol": "🌀",
"keywords": ["temporary", "fleeting", "short-lived", "momentary", "brief", "passing", "transient", "ephemeral"]
},
{
"id": "unmapped",
"name": "Unmapped",
"symbol": "📍",
"keywords": ["hidden", "undiscovered", "secret", "unknown", "unexplored", "off-map", "unmapped", "new"]
},
// ... more tags ...
]
[
{
"id": "15min",
"name": "15 Minutes",
"symbol": "⚡",
"keywords": ["15 minutes", "quarter hour", "quick stop", "brief moment", "passing", "fleeting"]
},
{
"id": "1hr",
"name": "1 Hour",
"symbol": "⏱️",
"keywords": ["1 hour", "one hour", "hour long", "60 minutes", "quick visit"]
},
// ... more durations ...
]
The step produces the following outputs:
"""
Step 1: Data Loading and Preparation
This module is responsible for loading user mementos from Firebase for ML training.
These user mementos will be used as training data for the ML models.
"""
import json
import os
import pandas as pd
import numpy as np
from typing import List, Dict, Optional, Tuple
import logging
from sklearn.preprocessing import MultiLabelBinarizer
import firebase_admin
from firebase_admin import credentials, firestore
from datetime import datetime
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re
class MementoDataLoader:
def __init__(self,
categories_path: str = None,
tags_path: str = None,
durations_path: str = None,
firebase_credentials_path: str = None):
"""Initialize the data loader"""
self.categories = {}
self.tags = {}
self.durations = {}
self.db = None
self.lemmatizer = WordNetLemmatizer()
self.stop_words = set(stopwords.words('english'))
# Load metadata if paths provided
if categories_path:
self.load_categories(categories_path)
if tags_path:
self.load_tags(tags_path)
if durations_path:
self.load_durations(durations_path)
# Initialize Firebase if credentials provided
if firebase_credentials_path:
self.firebase_credentials_path = firebase_credentials_path
self._initialize_firebase()
def _preprocess_text(self, text: str) -> str:
"""Preprocess text for ML"""
if not isinstance(text, str):
text = str(text)
text = text.lower()
text = re.sub(r'[^a-z\s]', ' ', text)
text = ' '.join(text.split())
return text
def _extract_text_for_ml(self, memento: Dict) -> str:
"""Extract and preprocess text for ML from memento"""
text_parts = []
name = memento.get('name', '')
text_parts.extend([name] * 3)
description = memento.get('description', '')
text_parts.append(description)
location = memento.get('location', {})
if isinstance(location, dict):
location_str = str(location)
text_parts.append(location_str)
combined_text = ' '.join(filter(None, text_parts))
return self._preprocess_text(combined_text)
def prepare_training_data(self, df: pd.DataFrame, output_dir: str) -> Dict[str, str]:
"""Prepare data for training"""
os.makedirs(output_dir, exist_ok=True)
processed_data = []
for _, memento in df.iterrows():
memento_dict = memento.to_dict()
text_for_ml = self._extract_text_for_ml(memento_dict)
memento_dict['text_for_ml'] = text_for_ml
processed_data.append(memento_dict)
processed_df = pd.DataFrame(processed_data)
output_path = os.path.join(output_dir, 'user_mementos_processed.csv')
processed_df.to_csv(output_path, index=False)
return {'processed_data': output_path}
This step defines the machine learning model architectures used for classifying mementos. It sets up three specialized models for different aspects of memento classification: categories, tags, and duration estimation.
The main objectives of this step are:
The implementation includes three main models:
The step requires the following inputs:
# Example input data structure
{
"text_for_ml": "processed text content",
"category": "category_id",
"tags": ["tag1", "tag2"],
"duration": "duration_value"
}
The step produces the following outputs:
"""
Step 2: Model Architecture Definition
This module defines the ML model architectures used for classifying mementos.
These models will be trained in Step 3 using user mementos from Firebase.
"""
import os
import json
import pickle
import numpy as np
import pandas as pd
import logging
from typing import Dict, List, Tuple, Optional
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, accuracy_score, f1_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MultiLabelBinarizer
class MementoModelTrainer:
def __init__(self,
categories_path: str,
tags_path: str,
durations_path: str,
output_dir: str = "."):
"""Initialize the model trainer"""
self.categories_path = categories_path
self.tags_path = tags_path
self.durations_path = durations_path
self.output_dir = output_dir
# Create output directory
os.makedirs(output_dir, exist_ok=True)
# Load metadata
self.categories, self.category_ids = self._load_categories()
self.tags, self.tag_ids = self._load_tags()
self.durations, self.duration_ids = self._load_durations()
# Initialize models
self.vectorizer = None
self.category_model = None
self.tags_model = None
self.duration_model = None
def train_models(self,
data_path: str,
test_size: float = 0.2,
random_state: int = 42,
use_grid_search: bool = False) -> Dict:
"""Train category, tag, and duration models"""
# Load and prepare data
X, y_category, y_tags, y_duration = self.load_data(data_path)
# Create text vectorizer
self.vectorizer = TfidfVectorizer(
max_features=5000,
min_df=2,
max_df=0.8,
ngram_range=(1, 2),
stop_words='english'
)
# Train models
self._train_category_model(X, y_category)
self._train_tags_model(X, y_tags)
self._train_duration_model(X, y_duration)
# Evaluate and save models
metrics = self._evaluate_models(X, y_category, y_tags, y_duration)
self._save_models()
self.save_model_info(metrics)
return metrics
This step trains the machine learning models defined in Step 2 using the processed user mementos from Step 1. It includes comprehensive model training, evaluation, and validation processes.
The main objectives of this step are:
The implementation includes several key components:
The step requires the following inputs:
# Example hyperparameter grid
param_grid = {
'C': [0.1, 1, 10],
'max_iter': [1000],
'class_weight': ['balanced'],
'solver': ['liblinear', 'saga']
}
The step produces the following outputs:
{
"training_date": "2024-03-15",
"model_versions": {
"category_classifier": "v1.2.0",
"tag_predictor": "v1.1.0",
"duration_estimator": "v1.0.1"
},
"training_metrics": {
"category_classification": {
"accuracy": 0.8132,
"f1_score": 0.7945,
"precision": 0.8023,
"recall": 0.7869,
"training_samples": 1250,
"validation_samples": 312
},
"tag_prediction": {
"micro_f1": 0.8895,
"macro_f1": 0.8567,
"hamming_loss": 0.1105,
"training_samples": 1250,
"validation_samples": 312
},
"duration_estimation": {
"accuracy": 0.7234,
"mean_absolute_error": 0.4567,
"training_samples": 1250,
"validation_samples": 312
}
},
"hyperparameters": {
"category_classifier": {
"model_type": "RandomForestClassifier",
"n_estimators": 200,
"max_depth": 10,
"min_samples_split": 5
},
"tag_predictor": {
"model_type": "MultiLabelClassifier",
"threshold": 0.3,
"base_estimator": "LinearSVC"
},
"duration_estimator": {
"model_type": "DecisionTreeClassifier",
"max_depth": 5,
"min_samples_leaf": 10
}
},
"training_time": {
"data_preparation": "45.2s",
"model_training": "189.7s",
"validation": "23.4s",
"total": "258.3s"
},
"validation_notes": [
"Category classification shows strong performance on major categories",
"Tag prediction achieves high micro-F1 score indicating good overall accuracy",
"Duration estimation needs improvement, particularly for longer durations",
"Model size optimized for production deployment"
]
}
01_vectorizer.pkl
01_category_model.pkl
02_category_model.pkl
02_tags_model.pkl
02_duration_model.pkl
"""
Step 3: Model Training
This module trains the ML models defined in Step 2 using user mementos from Step 1.
The trained models will be used to classify scraped data in Step 5.
"""
import os
import json
import logging
import pandas as pd
import numpy as np
from typing import Dict, List, Tuple, Optional
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score
from sklearn.preprocessing import MultiLabelBinarizer
import pickle
from datetime import datetime
class MementoModelTrainer:
def __init__(self,
input_dir: str = "ml_pipeline/output/step1_data_processing/processed_data",
output_dir: str = "ml_pipeline/output/step3_model_training"):
"""Initialize the trainer"""
self.input_dir = input_dir
self.output_dir = output_dir
self.models_dir = os.path.join(output_dir, "models")
self.metrics_dir = os.path.join(output_dir, "metrics")
self.reports_dir = os.path.join(output_dir, "reports")
# Create output directories
os.makedirs(self.models_dir, exist_ok=True)
os.makedirs(self.metrics_dir, exist_ok=True)
os.makedirs(self.reports_dir, exist_ok=True)
# Define hyperparameter grids
self.param_grid = {
'C': [0.1, 1, 10],
'max_iter': [1000],
'class_weight': ['balanced'],
'solver': ['liblinear', 'saga']
}
def train_models(self):
"""Train all models"""
# Load and prepare data
df = self.load_training_data()
X, y_dict = self.prepare_training_data(df)
# Train models
category_metrics = self.train_category_model(X, y_dict['category'])
tags_metrics = self.train_tags_model(X, y_dict['tags'])
duration_metrics = self.train_duration_model(X, y_dict['duration'])
# Save vectorizer
self.save_vectorizer()
# Generate and save reports
metrics = {
'category': category_metrics,
'tags': tags_metrics,
'duration': duration_metrics
}
self.generate_model_report(metrics)
This step scrapes data from Secret NYC to create a testing dataset for our ML models. It collects public content that will be processed and classified in subsequent steps.
The main objectives of this step are:
The implementation includes several key components:
The step requires the following inputs:
# Example configuration
base_url = "https://secretnyc.co/things-to-do/"
output_dir = "ml_pipeline/output/step4_scraped_data"
The step produces the following outputs:
"""
Step 4: Scrape Data
This module scrapes data from Secret NYC to create a testing dataset.
"""
import requests
from bs4 import BeautifulSoup
import json
import os
import re
import time
from datetime import datetime
from langdetect import detect
from typing import Dict, List, Optional
import logging
from requests.adapters import HTTPAdapter
from urllib3.util import Retry
class SecretNYCScraper:
def __init__(self,
base_url: str = "https://secretnyc.co/things-to-do/",
output_dir: str = "ml_pipeline/output/step4_scraped_data"):
"""Initialize scraper"""
self.base_url = base_url
self.output_dir = output_dir
self.raw_data_dir = os.path.join(output_dir, "raw_data")
os.makedirs(self.raw_data_dir, exist_ok=True)
self.default_user = "Secret NYC"
def create_session(self) -> requests.Session:
"""Create session with retry strategy"""
session = requests.Session()
retry_strategy = Retry(
total=3,
backoff_factor=1,
status_forcelist=[429, 500, 502, 503, 504],
)
adapter = HTTPAdapter(max_retries=retry_strategy)
session.mount("http://", adapter)
session.mount("https://", adapter)
session.headers.update({
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
})
return session
def scrape_article(self, article_url: str) -> Optional[Dict]:
"""Scrape a single article"""
session = self.create_session()
try:
response = session.get(article_url, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.text, "html.parser")
# Extract article data
title = soup.find("h1").get_text(strip=True) if soup.find("h1") else "Untitled"
paragraphs = soup.select("section.article__body p")
desc = "\n".join(p.get_text(strip=True) for p in paragraphs if p.get_text(strip=True))
# Process and clean data
cleaned = self.clean_description(desc)
location = self.extract_fallback_location(desc, title)
duration = self.extract_duration(desc)
# Create memento object
memento = {
"userId": self.default_user,
"location": self.geocode_location(location) if location else {},
"media": self.extract_media(soup),
"name": title,
"description": cleaned,
"category": "Other",
"timestamp": self.parse_date(self.extract_date(soup)),
"tags": ["Other"],
"link": article_url,
"mementoType": "public",
"duration": duration
}
return memento
except Exception as e:
logging.error(f"Error scraping {article_url}: {e}")
return None
This step processes the scraped data from Step 4 through the trained models from Step 3. It classifies each scraped memento into categories, tags, and durations, applying quality control measures to ensure accurate predictions.
The main objectives of this step are:
The implementation includes several key components:
The step requires the following inputs:
# Example confidence thresholds
confidence_thresholds = {
"category": 0.4,
"tags": 0.3,
"duration": 0.4
}
The step produces the following outputs:
ml_pipeline/output/step5_processed_data/validation/
(for validation logs and future outputs)"""
Step 5: Process Scraped Data
This module processes scraped data from Step 4 through trained models from Step 3.
It classifies each scraped memento into categories, tags, and durations.
"""
import json
import os
import pandas as pd
import numpy as np
from typing import List, Dict, Optional, Tuple
import logging
import pickle
from sklearn.preprocessing import MultiLabelBinarizer
from datetime import datetime
class ScrapedDataProcessor:
def __init__(self,
model_dir: str = "ml_pipeline/output/step3_model_training/models",
output_dir: str = "ml_pipeline/output/step5_processed_data"):
"""Initialize the processor"""
self.model_dir = model_dir
self.output_dir = output_dir
self.processed_data_dir = os.path.join(output_dir, "processed_data")
self.reports_dir = os.path.join(output_dir, "reports")
self.validation_dir = os.path.join(output_dir, "validation")
# Confidence thresholds
self.confidence_thresholds = {
"category": 0.4,
"tags": 0.3,
"duration": 0.4
}
# Create output directories
os.makedirs(self.processed_data_dir, exist_ok=True)
os.makedirs(self.reports_dir, exist_ok=True)
os.makedirs(self.validation_dir, exist_ok=True)
# Load models
self._load_models()
def process_memento(self, memento: Dict) -> Dict:
"""Process a single memento"""
# Extract text for ML
text = self._extract_text_for_ml(memento)
# Vectorize text
text_vec = self.vectorizer.transform([text])
# Get predictions
category_pred, category_conf = self._predict_category(text_vec)
tags_pred, tags_conf = self._predict_tags(text_vec)
duration_pred, duration_conf = self._predict_duration(text_vec)
# Validate predictions
validation = self._validate_predictions(
category_pred, category_conf,
tags_pred, tags_conf,
duration_pred, duration_conf
)
# Update memento with predictions
memento.update({
"category": category_pred,
"category_confidence": category_conf,
"tags": tags_pred,
"tags_confidence": tags_conf,
"duration": duration_pred,
"duration_confidence": duration_conf,
"validation": validation
})
return memento
This step provides a production-ready predictor class that uses trained ML models to classify new mementos with categories, tags, and durations. It's designed for seamless integration with the scraper and other components of the system.
The main objectives of this step are:
The implementation includes several key components:
The step requires the following inputs:
# Example model paths
models_dir = "ml_pipeline/models"
categories_path = "memento_categories_combined.json"
tags_path = "memento_tags_combined.json"
durations_path = "memento_durations.json"
The step produces the following outputs:
# Example prediction output
{
"category": "🌳 Outdoors",
"tags": ["🌅 Sunset", "🎵 Music", "🏞️ Parks"],
"duration": "1-2 hours",
"confidence_scores": {
"category": 0.85,
"tags": [0.75, 0.65, 0.60],
"duration": 0.80
}
}
"""
Step 6: ML Model Predictor
This module provides a production-ready predictor class that uses trained ML models
to classify new mementos with categories, tags, and durations.
"""
import os
import json
import pickle
import logging
import numpy as np
from typing import List, Dict, Optional, Union
class MementoPredictor:
def __init__(self,
models_dir: Optional[str] = None,
categories_path: Optional[str] = None,
tags_path: Optional[str] = None,
durations_path: Optional[str] = None,
threshold: float = 0.2):
"""Initialize the predictor with trained models"""
# Set up paths
self.script_dir = os.path.dirname(os.path.abspath(__file__))
self.root_dir = os.path.dirname(self.script_dir)
# Set models_dir if not provided
if models_dir is None:
models_dir = os.path.join(self.script_dir, "models")
self.models_dir = models_dir
# Set paths for category, tag, and duration definitions
self.categories_path = categories_path or os.path.join(self.root_dir, "memento_categories_combined.json")
self.tags_path = tags_path or os.path.join(self.root_dir, "memento_tags_combined.json")
self.durations_path = durations_path or os.path.join(self.root_dir, "memento_durations.json")
# Set threshold
self.threshold = threshold
# Initialize models and data
self.vectorizer = None
self.category_model = None
self.tags_model = None
self.duration_model = None
self.categories = None
self.tags = None
self.durations = None
# Load models and data
self._load_categories_and_tags()
self._load_models()
def classify_memento(self,
description: str,
context: Optional[Dict] = None) -> Dict[str, Union[str, List[str]]]:
"""Classify a memento with categories, tags, and duration"""
# Get predictions
category = self.predict_category(description)
tags = self.predict_tags(description)
duration = self.predict_duration(description)
# Return complete classification
return {
"category": category,
"tags": tags,
"duration": duration
}
This step provides tools to integrate the trained ML models with production systems, particularly focusing on updating existing scrapers to use ML-based classification. It includes robust safety features and comprehensive integration tools.
The main objectives of this step are:
The implementation includes several key components:
The step requires the following inputs:
# Example command-line usage
python step7_integration.py --scraper path/to/scraper.py
python step7_integration.py --find --search-dir path/to/search
python step7_integration.py --scraper path/to/scraper.py --dryrun
The step produces the following outputs:
# Example analysis output
{
"path": "path/to/scraper.py",
"has_assign_category": true,
"has_assign_tags": true,
"already_using_ml": false,
"imports": ["import requests", "from bs4 import BeautifulSoup"],
"size": 1024,
"lines": 50
}
"""
Step 7: ML Pipeline Integration
This module provides tools to integrate the trained ML models with production systems,
particularly focusing on updating existing scrapers to use ML-based classification.
"""
import os
import sys
import argparse
import logging
import shutil
from datetime import datetime
from typing import Tuple
# Import the predictor
from step6_predictor import MementoPredictor
def find_scrapers(search_dir: str = None) -> list:
"""Find all potential scraper files in the given directory"""
if search_dir is None:
search_dir = os.path.dirname(script_dir)
potential_scrapers = []
# Search patterns that indicate a scraper file
patterns = [
"scrape_all_pages",
"scrape_",
"_scraper",
"crawler",
]
# Walk through the directory
for root, _, files in os.walk(search_dir):
for file in files:
if file.endswith(".py"):
file_path = os.path.join(root, file)
# Check if filename matches any pattern
if any(pattern in file.lower() for pattern in patterns):
potential_scrapers.append(file_path)
else:
# Check file contents for key functions
try:
with open(file_path, "r", encoding="utf-8") as f:
content = f.read()
if "def assign_category" in content and "def assign_tags" in content:
potential_scrapers.append(file_path)
except Exception:
pass
return potential_scrapers
def update_scraper(scraper_path: str, dryrun: bool = False) -> bool:
"""Update a scraper file to use ML-based classification"""
# First analyze the scraper
can_update, info = analyze_scraper(scraper_path)
if not can_update:
logging.error(f"Cannot update {scraper_path}: {info.get('error', 'Missing required functions')}")
return False
if info.get("already_using_ml", False):
logging.warning(f"Scraper {scraper_path} is already using ML predictor")
return False
if dryrun:
logging.info("DRY RUN: Would update scraper with ML predictor")
return True
# Create a backup
backup_path = f"{scraper_path}.bak.{datetime.now().strftime('%Y%m%d_%H%M%S')}"
try:
shutil.copy2(scraper_path, backup_path)
logging.info(f"Created backup at {backup_path}")
except Exception as e:
logging.error(f"Failed to create backup: {e}")
return False
# Update the scraper
try:
predictor = MementoPredictor()
success = predictor.update_scraper(scraper_path)
if success:
logging.info(f"Successfully updated {scraper_path}")
return True
else:
logging.error(f"Failed to update {scraper_path}")
return False
except Exception as e:
logging.error(f"Error updating scraper: {e}")
# Try to restore from backup
try:
shutil.copy2(backup_path, scraper_path)
logging.info(f"Restored from backup {backup_path}")
except Exception as restore_error:
logging.error(f"Failed to restore from backup: {restore_error}")
return False
After running the complete MEMENTO ML pipeline, we have generated rich, structured datasets of public mementos scraped from Secret NYC. These datasets represent real urban experiences, now classified and ready to be visualized as interactive markers in the MEMENTO web-app. Each dataset below corresponds to a different theme or category, providing a foundation for exploration, discovery, and engagement within the platform.
Each card below represents a dataset of public mementos, grouped by theme. These are the actual lists used to populate the public mementos map and features in the MEMENTO web-app.
While our current results show promise (81.32% category accuracy and 88.95% tag F1-score), there's significant room for improvement. The main challenges stem from data limitations and basic model architecture. However, with the proposed improvements in data quality, model architecture, and system optimization, we expect to achieve significantly better performance in the coming months.