Click to Segment: Building an Interactive Image Segmentation Demo with SAM 2.1

Upload an image, click on any object, and it gets highlighted instantly. Not science fiction — just Meta's open-source SAM 2.1 combined with full-stack engineering.

renderings

Background: What Is Image Segmentation?

Image segmentation is one of the core tasks in computer vision: given an image, determine which object each pixel belongs to.

Traditional approaches require massive amounts of labeled data and extensive model fine-tuning — a high bar to clear.

In 2023, Meta released the Segment Anything Model (SAM), which fundamentally changed the game. SAM can perform zero-shot segmentation on any image for any object, requiring only a point or bounding box as a prompt from the user.

In September 2024, Meta released SAM 2.1 — better performance with fewer parameters, new video segmentation capability, and fully open-sourced under Apache 2.0.

效果图

Why SAM 2 Instead of SAM 3?

Meta's SAM family now spans three generations. Here's how they compare:

Version	Released	Video Segmentation	Model Size	Access
SAM 1	2023.4	✗	86M ~ 641M	No approval needed
SAM 2 / 2.1	2024.7 / 2024.9	✓	39M ~ 224M	No approval needed
SAM 3	2025.11	✓	Undisclosed	Approval required

Why not SAM 3?

SAM 3 was released in November 2025, but downloading the weights requires submitting an access request on Hugging Face and waiting for approval. On top of that, SAM 3 mandates a CUDA GPU (PyTorch 2.7+, CUDA 12.6+), uses the SAM License instead of Apache 2.0, and comes with more restrictions. For a quick local demo, the barriers are simply too high.

Why not SAM 1?

SAM 1 doesn't support video segmentation, and its smallest model is still 86M — both slower and less accurate than SAM 2.1 at comparable sizes.

Why SAM 2.1:

Latest officially open release — weights download directly with wget, zero approval needed
4 size variants (39M ~ 224M); the tiny model runs on CPU
Unified architecture for image and video segmentation
Apache 2.0 — fully open source and commercially friendly

效果图

The Project: FastSAM-Demo

FastSAM-Demo is an interactive image segmentation web app built on SAM 2.1.

The interaction is as simple as it gets: upload an image → click an object → see it highlighted in real time.

Features

Feature	Description
Click-to-segment	Millisecond-level response, no waiting
Multiple model sizes	tiny(39M) / small(46M) / base+(81M) / large(224M)
CPU-friendly	The tiny model runs without a GPU
Multi-object labeling	Different colors for multiple segmented regions
Efficient transfer	RLE compression with >98% reduction in mask data size
No approval needed	Weights download directly, Apache 2.0 licensed

Tech Stack

Backend:  FastAPI + SAM 2.1 (Python / uv)
Frontend: Next.js 15 + TypeScript + Tailwind CSS v4 + Framer Motion

System Architecture

Two layers: the frontend handles interaction and rendering; the backend handles model inference.

Browser (Next.js + TypeScript)
│
│  User clicks image at (x, y)
│
▼
useSegmentation Hook
│ POST /api/segment
▼
FastAPI Server
│
├── Image Cache (reuses embedding, avoids re-encoding)
│
└── SAM 2.1 ImagePredictor
      ├── set_image()   ← computed once on upload
      └── predict()     ← called on each click, returns in milliseconds

Key Design: Embedding Cache

SAM 2.1 inference has two stages with very different latencies:

Stage	Operation	Latency
Image Encoder	`set_image()` — precompute image features	0.1s ~ 10s
Mask Decoder	`predict()` — generate mask from a point	10ms ~ 100ms

The optimization: image embedding is computed once on upload and cached. Every subsequent click only runs the lightweight Mask Decoder. This is why the interaction feels instantaneous.

Full Data Flow

1. User uploads image
 → POST /api/upload
 → Backend calls set_image(), caches embedding in memory
 → Returns image_id

2. User clicks image at (x, y)
 → Frontend maps coordinates (Canvas → original image space)
 → POST /api/segment { image_id, x, y }
 → Backend reuses cached embedding, calls predict()
 → SAM 2.1 returns masks + scores
 → RLE-compressed response sent to frontend
 → Frontend decodes and renders semi-transparent highlight on Canvas

3. More clicks → repeat step 2, masks accumulate

How SAM 2.1 Works

Architecture

SAM 2.1 is built around three core modules:

Image / Video Frame
  │
  ▼
Hiera Encoder (image encoder)
  │ image embedding
  ▼
Mask Decoder ← Prompt Encoder (accepts point / box / mask prompts)
  │
  ▼
masks + scores + logits

Hiera Encoder: a hierarchical visual encoder more efficient than SAM 1's ViT, with significantly fewer parameters
Prompt Encoder: encodes click coordinates, bounding boxes, or mask hints
Mask Decoder: fuses image features and prompts to output segmentation masks
Memory Attention: propagates segmentation across frames in video mode (not used in this image-only demo)

Model Size Comparison

Model	Parameters	Speed (A100)	Accuracy (SA-V J&F)
tiny	38.9M	91.2 FPS	76.5
small	46M	84.8 FPS	76.6
base+	80.8M	64.1 FPS	78.2
large	224.4M	39.5 FPS	79.5

For local CPU demos, tiny is the recommended choice — fast enough with acceptable quality.

Frontend Architecture

The frontend uses Next.js 15 (App Router) + TypeScript with clear separation of concerns:

types/       TypeScript type definitions (aligned with backend Pydantic schemas)
lib/         API client + RLE decoder + Canvas rendering utilities
hooks/       useSegmentation (business logic in one place)
components/  ImageUploader / SegmentCanvas / ControlPanel (pure UI)
app/         Next.js page entry points

Core principle: components own the UI; hooks own the logic.

Next.js rewrites proxy all API requests to the backend, eliminating CORS issues without any extra configuration in the frontend code.

Getting Started

# Clone the repo
git clone https://github.com/Eva-Dengyh/FastSAM-Demo.git
cd FastSAM-Demo

# One-command setup: installs dependencies, downloads model, starts both servers
./start.sh

Once running:

Frontend: http://localhost:3000
Backend API docs: http://localhost:8000/docs

Prefer manual setup?

# Backend
cd backend && uv sync
mkdir -p checkpoints
wget -O checkpoints/sam2.1_hiera_tiny.pt \
https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_tiny.pt
uv run uvicorn app.main:app --host 0.0.0.0 --port 8000

# Frontend (new terminal)
cd frontend && npm install && npm run dev

Closing Thoughts

This project brings together state-of-the-art AI with full-stack engineering practice:

SAM 2.1 delivers world-class segmentation capability, fully open source and free
The FastAPI + Next.js stack demonstrates a clean, layered architecture
Details like embedding caching and RLE compression show attention to real-world performance

If you're interested in AI + full-stack development, this is a solid starting point. The code is fully open source — Stars, forks, and issues are all welcome.

GitHub: https://github.com/Eva-Dengyh/FastSAM-Demo