NIPS 2017 Paper Clustering Analysis

I scraped the accepted papers from the 2017 Neural Information Processing Systems Conference (NIPS) and used TF-IDF and k-means clustering to sort them into categories based on paper topic. The analysis revealed 9 distinct research clusters covering areas from generative adversarial networks to reinforcement learning optimization.

Overview

I sorted all 567 accepted papers from NIPS 2017 by topic using natural language processing and unsupervised learning techniques. The process involved:

Data Collection: Scraped paper titles and abstracts from the NIPS 2017 conference schedule
Text Preprocessing: Concatenated titles with abstracts and applied TF-IDF vectorization
Similarity Analysis: Computed cosine similarity between document vectors
Dimensionality Reduction: Used multidimensional scaling to project into 2D space
Clustering: Applied k-means algorithm to group papers into thematic clusters

Motivation

During my time at the conference, I found myself naturally sorting papers and presentations into categories like unsupervised learning (autoencoders, clustering), supervised learning (vanilla NNs, RNNs, CNNs, GANs), and reinforcement learning. I was curious to see if machine learning techniques could replicate this human categorization process.

I also noticed papers could be grouped by the type of improvement they made: requiring less data, training faster, increasing accuracy, reducing overfitting, guaranteeing convergence, etc. This motivated me to explore whether algorithmic clustering would reveal similar patterns.

Process

The analysis was broken into three main phases:

1. Data Scraping

Used a Python scraping script to extract titles and links from the NIPS website
Collected abstracts by following each paper’s individual link
Final dataset: 567 papers with complete title and abstract information

2. Data Cleaning

Tokenization and Stemming: Broke down abstracts and titles into individual word stems using NLTK’s Snowball Stemmer
Stopword Removal: Eliminated common words like “the”, “and”, “as” that don’t distinguish between topics
Custom Stopwords: Added domain-specific stopwords that appeared frequently but weren’t useful for clustering:
- General ML terms: “algorithm”, “framework”, “evaluate”, “model”, “neural”, “network”
- Common descriptors: “novel”, “efficient”, “robust”, “significant”, “demonstrate”
- Process words: “approach”, “method”, “technique”, “process”, “perform”

3. Clustering and Visualization

TF-IDF Vectorization: Converted text to numerical vectors with max_df=0.8 parameter
K-means Clustering: Applied clustering algorithm with k=9 (determined through experimentation)
Hierarchical Clustering: Used Ward’s method to create dendrogram visualization
2D Projection: Employed multidimensional scaling for scatter plot visualization

Challenges

Hyperparameter Tuning

Finding the optimal TF-IDF parameters required significant experimentation. The max_df parameter proved crucial:

Too low (< 0.6): Excluded important but moderately frequent terms
Too high (> 0.9): Included too many common terms, reducing cluster separation
Optimal (0.8): Balanced meaningful term inclusion with noise reduction

Determining Cluster Count

Since papers often span multiple topics (e.g., “optimization methods in computer vision”), finding the right number of clusters was challenging:

Lower bound (k=3): Too broad - merged distinct research areas
Upper bound (k=15): Too granular - split related concepts unnecessarily
Optimal range (k=7-9): Balanced topic separation with meaningful groupings

I used both quantitative measures (within-cluster sum of squares) and qualitative assessment (keyword coherence, visual separation) to select k=9.

Overlapping Research Areas

Many papers intersect multiple domains. For example, a paper on “sparse convolutional neural network optimization” could belong to:

Computer vision (convolutional networks)
Optimization (gradient methods)
Sparsity (regularization techniques)

The clustering algorithm naturally grouped papers by their most dominant themes, with less prevalent topics appearing as secondary characteristics.

Outcomes

The multidimensional scaling projection below shows how the 567 papers cluster in 2D space based on their content similarity:

NIPS 2017 2D Clustering

2D projection of NIPS 2017 papers using multidimensional scaling. Each point represents a paper, colored by cluster assignment. The legend shows the top keywords for each of the 9 clusters.

Cluster Analysis Results

The k-means algorithm identified 9 distinct research clusters:

Approximation & Regression (81 papers)
- Keywords: approximation, linear, kernel, regression, empirical
- Focus: Statistical learning theory, kernel methods, linear models
Generative & Adversarial Models (48 papers)
- Keywords: generative, adversarial, image, label, joint
- Focus: GANs, generative modeling, image synthesis
Statistical Estimation (48 papers)
- Keywords: estimation, variables, statistics, parameters
- Focus: Bayesian methods, statistical inference, parameter estimation
Variational Inference (62 papers)
- Keywords: inference, variational, latent, dynamical, variables
- Focus: Probabilistic modeling, latent variable models, variational methods
Feature Learning & CNNs (142 papers) - Largest cluster
- Keywords: predictions, representations, features, convolutional, label
- Focus: Deep learning, feature extraction, supervised learning
Graph & Probability Theory (36 papers)
- Keywords: graphs, probability, random, optimization
- Focus: Graph neural networks, probabilistic models, random processes
Computer Vision (35 papers)
- Keywords: image, convolutional, features, input
- Focus: Image processing, computer vision applications
Reinforcement Learning (93 papers) - Second largest
- Keywords: policy, reinforcement, dynamical, gradient, optimization
- Focus: RL algorithms, policy optimization, control theory
Optimization Methods (22 papers)
- Keywords: optimization, gradient, convergence, descent, convex
- Focus: Optimization algorithms, convergence theory, gradient methods

Key Insights

Implicit Category Emergence: Terms like “supervised” and “unsupervised” didn’t appear directly in cluster keywords, but emerged implicitly through grouped concepts (e.g., “autoencoder” + “clustering” vs. “regression” + “classification”)
Research Volume Distribution: Feature learning/CNNs and reinforcement learning dominated the conference, reflecting 2017’s research priorities
Topic Associations: The algorithm successfully identified meaningful relationships:
- “Image” + “convolutional” (computer vision)
- “Policy” + “reinforcement” (RL)
- “Bayesian” + “variational” (probabilistic methods)

Hierarchical Clustering Visualization

The Ward clustering dendrogram below shows the hierarchical relationships between all 567 papers, with similar papers grouped closer together and different research areas forming distinct branches:

NIPS 2017 Hierarchical Clustering

Hierarchical clustering dendrogram of NIPS 2017 papers using Ward’s method. Each paper title is shown on the left, with colored branches indicating different research clusters.

Lessons Learned

Domain Expertise Matters: Subjective evaluation of cluster quality improved significantly with ML domain knowledge. Understanding which concepts should be grouped together was crucial for parameter tuning.
Iterative Refinement: The custom stopword list required multiple iterations. Initial clustering revealed domain-specific common words that needed removal.
Visualization Importance: 2D projections and dendrograms were essential for validating cluster quality beyond quantitative metrics.
Hyperparameter Sensitivity: Small changes in TF-IDF parameters dramatically affected clustering results, highlighting the importance of careful tuning.
Research Trend Reflection: The cluster sizes accurately reflected 2017’s research priorities, with deep learning and reinforcement learning dominating the field.

References

NIPS 2017 Conference
Brandon Rose’s Document Clustering Tutorial - Foundational methodology
TF-IDF Wikipedia
K-means Clustering
Ward’s Clustering Method

Technical Implementation

The complete analysis was implemented in Python using:

Web Scraping: Custom Python scripts with requests/BeautifulSoup
Text Processing: NLTK for tokenization, stemming, and stopword removal
ML Pipeline: scikit-learn for TF-IDF vectorization and k-means clustering
Visualization: matplotlib and mpld3 for interactive plots, scipy for dendrograms

The project demonstrates how unsupervised learning can reveal latent structure in academic literature, providing insights into research trends and topic relationships within a scientific conference.