2025-05-24 13:57:58 +02:00

674 lines
27 KiB
Markdown

# SEREACT - Secure Image Management API
SEREACT is a secure API for storing, organizing, and retrieving images with advanced search capabilities.
## Features
- Secure image storage in Google Cloud Storage
- Team-based organization and access control
- API key authentication
- Semantic search using image embeddings
- Metadata extraction and storage
- Image processing capabilities
- Multi-team support
- **Comprehensive E2E testing with real database support**
## Architecture
```
sereact/
├── images/ # Sample images for testing
├── deployment/ # Deployment configurations
│ ├── cloud-run/ # Google Cloud Run configuration
│ └── terraform/ # Infrastructure as code
├── docs/ # Documentation
│ ├── api/ # API documentation
│ └── TESTING.md # Comprehensive testing guide
├── scripts/ # Utility scripts
├── src/ # Source code
│ ├── api/ # API endpoints and routers
│ │ └── v1/ # API version 1 routes
│ ├── auth/ # Authentication and authorization
│ ├── config/ # Configuration management
│ ├── core/ # Core application logic
│ ├── db/ # Database layer
│ │ ├── providers/ # Database providers (Firestore)
│ │ └── repositories/ # Data access repositories
│ ├── models/ # Database models
│ ├── schemas/ # API request/response schemas
│ ├── services/ # Business logic services
│ └── utils/ # Utility functions
├── tests/ # Test code
│ ├── api/ # API tests
│ ├── auth/ # Authentication tests
│ ├── models/ # Model tests
│ ├── services/ # Service tests
│ ├── integration/ # Integration tests
│ └── test_e2e.py # **Comprehensive E2E workflow tests**
├── main.py # Application entry point
├── requirements.txt # Python dependencies
└── README.md # This file
```
## System Architecture
```
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ │ │ │ │ │
│ FastAPI │ ───────▶│ Firestore │◀────────│ Cloud │
│ Backend │ │ Database │ │ Functions │
│ │ │ │ │ │
└─────┬───────┘ └─────────────┘ └──────┬──────┘
│ │
│ │
▼ │
┌─────────────┐ ┌─────────────┐ │
│ │ │ │ │
│ Cloud │ │ Pub/Sub │ │
│ Storage │────────▶│ Queue │────────────────┘
│ │ │ │
└─────────────┘ └─────────────┘
┌─────────────┐ ┌─────────────┐
│ │ │ │
│ Cloud │ │ Pinecone │
│ Vision API │────────▶│ Vector DB │
│ │ │ │
└─────────────┘ └─────────────┘
```
1. **Image Upload Flow**:
- Images are uploaded through the FastAPI backend
- Images are stored in Cloud Storage
- A message is published to Pub/Sub queue
2. **Embedding Generation Flow**:
- Cloud Function is triggered by Pub/Sub message
- Function calls Cloud Vision API to generate image embeddings
- Embeddings are stored in Pinecone Vector DB
3. **Search Flow**:
- Search queries processed by FastAPI backend
- Vector similarity search performed against Pinecone
- Results combined with metadata from Firestore
## Technology Stack
- FastAPI - Web framework
- Firestore - Database
- Google Cloud Storage - Image storage
- Google Pub/Sub - Message queue
- Google Cloud Functions - Serverless computing
- Google Cloud Vision API - Image analysis and embedding generation
- Pinecone - Vector database for semantic search
- Pydantic - Data validation
## Setup and Installation
### Prerequisites
- Python 3.8+
- Google Cloud account with Firestore, Storage, Pub/Sub, and Cloud Functions enabled
- Pinecone account for vector database
### Installation
1. Clone the repository:
```bash
git clone https://github.com/yourusername/sereact.git
cd sereact
```
2. Create and activate a virtual environment:
```bash
python -m venv venv
source venv/bin/activate # Linux/macOS
venv\Scripts\activate # Windows
```
3. Install dependencies:
```bash
pip install -r requirements.txt
```
4. Create a `.env` file with the following environment variables:
```
# Firestore
FIRESTORE_PROJECT_ID=your-gcp-project-id
FIRESTORE_CREDENTIALS_FILE=path/to/firestore-credentials.json
# Google Cloud Storage
GCS_BUCKET_NAME=your-bucket-name
GCS_CREDENTIALS_FILE=path/to/credentials.json
# Google Pub/Sub
PUBSUB_TOPIC=image-processing-topic
# Google Cloud Vision
VISION_API_ENABLED=true
# Security
API_KEY_SECRET=your-secret-key
# Vector database
VECTOR_DB_API_KEY=your-pinecone-api-key
VECTOR_DB_ENVIRONMENT=your-pinecone-environment
VECTOR_DB_INDEX_NAME=image-embeddings
```
5. Run the application:
```bash
uvicorn main:app --reload
```
6. Visit `http://localhost:8000/docs` in your browser to access the API documentation.
## API Endpoints
The API provides the following main endpoints:
- `/api/v1/auth/*` - Authentication and API key management
- `/api/v1/teams/*` - Team management
- `/api/v1/users/*` - User management
- `/api/v1/images/*` - Image upload, download, and management
- `/api/v1/search/*` - Image search functionality
Refer to the Swagger UI documentation at `/docs` for detailed endpoint information.
## Development
### Running Tests
```bash
pytest
```
### **Comprehensive End-to-End Testing**
SEREACT includes a comprehensive E2E testing suite that covers complete user workflows with **completely self-contained artificial test data**:
```bash
# Run all E2E tests (completely self-contained - no setup required!)
python scripts/run_tests.py e2e
# Run unit tests only (fast)
python scripts/run_tests.py unit
# Run integration tests (requires real database)
python scripts/run_tests.py integration
# Run all tests
python scripts/run_tests.py all
# Run with coverage report
python scripts/run_tests.py coverage
```
#### **E2E Test Coverage**
Our comprehensive E2E tests cover:
**Core Functionality:**
- ✅ **Bootstrap Setup**: Automatic creation of isolated test environment with artificial data
- ✅ **Authentication**: API key validation and verification
- ✅ **Team Management**: Create, read, update, delete teams
- ✅ **User Management**: Create, read, update, delete users
- ✅ **API Key Management**: Create, list, revoke API keys
**Image Operations:**
- ✅ **Image Upload**: File upload with metadata
- ✅ **Image Retrieval**: Get image details and download
- ✅ **Image Updates**: Modify descriptions and tags
- ✅ **Image Listing**: Paginated image lists with filters
**Advanced Search Functionality:**
- ✅ **Text Search**: Search by description content
- ✅ **Tag Search**: Filter by tags
- ✅ **Advanced Search**: Combined filters and thresholds
- ✅ **Similarity Search**: Find similar images using embeddings
- ✅ **Search Performance**: Response time validation
**Security and Isolation:**
- ✅ **User Roles**: Admin vs regular user permissions
- ✅ **Multi-team Isolation**: Data privacy between teams
- ✅ **Access Control**: Unauthorized access prevention
- ✅ **Error Handling**: Graceful error responses
**Performance and Scalability:**
- ✅ **Bulk Operations**: Multiple image uploads
- ✅ **Concurrent Access**: Simultaneous user operations
- ✅ **Database Performance**: Query response times
- ✅ **Data Consistency**: Transaction integrity
#### **Test Features**
**🎯 Completely Self-Contained**
- **No setup required**: Tests create their own isolated environment
- **Artificial test data**: Each test class creates unique teams, users, and images
- **Automatic cleanup**: All test data is deleted after tests complete
- **No environment variables needed**: Just run the tests!
**🔒 Isolated and Safe**
- **Unique identifiers**: Each test uses timestamp-based unique names
- **No conflicts**: Tests can run in parallel without interference
- **No database pollution**: Tests don't affect existing data
- **Idempotent**: Can be run multiple times safely
**⚡ Performance-Aware**
- **Class-scoped fixtures**: Expensive setup shared across test methods
- **Efficient cleanup**: Resources deleted in optimal order
- **Real database tests**: Optional performance testing with larger datasets
- **Timing validation**: Response time assertions for critical operations
#### **Advanced Test Modes**
**Standard E2E Tests (No Setup Required)**
```bash
# Just run them - completely self-contained!
python scripts/run_tests.py e2e
```
**Integration Tests with Real Services**
```bash
# Enable integration tests with real Google Cloud services
export E2E_INTEGRATION_TEST=1
pytest -m integration
```
**Real Database Performance Tests**
```bash
# Enable real database tests with larger datasets
export E2E_REALDB_TEST=1
pytest -m realdb
```
For detailed testing information, see [docs/TESTING.md](docs/TESTING.md).
### Creating a New API Version
1. Create a new package under `src/api/` (e.g., `v2`)
2. Implement new endpoints
3. Update the main.py file to include the new routers
## Deployment
### Google Cloud Run
1. Build the Docker image:
```bash
docker build -t gcr.io/your-project/sereact .
```
2. Push to Google Container Registry:
```bash
docker push gcr.io/your-project/sereact
```
3. Deploy to Cloud Run:
```bash
gcloud run deploy sereact --image gcr.io/your-project/sereact --platform managed
```
## Local Development with Docker Compose
To run the application locally using Docker Compose:
1. Make sure you have Docker and Docker Compose installed
2. Run the following command in the project root:
```bash
docker compose up
```
This will:
- Build the API container based on the Dockerfile
- Mount your local codebase into the container for live reloading
- Mount your Firestore credentials for authentication
- Expose the API on http://localhost:8000
To stop the containers:
```bash
docker compose down
```
To rebuild containers after making changes to the Dockerfile or requirements:
```bash
docker compose up --build
```
## Additional Information
## Design Decisions
### Database Selection: Firestore
- **Document-oriented model**: Ideal for hierarchical team/user/image data structures with flexible schemas
- **Real-time capabilities**: Enables live updates for collaborative features
- **Automatic scaling**: Handles variable workloads without manual intervention
- **ACID transactions**: Ensures data integrity for critical operations
- **Security rules**: Granular access control at the document level
- **Seamless GCP integration**: Works well with other Google Cloud services
### Storage Solution: Google Cloud Storage
- **Object storage optimized for binary data**: Perfect for image files of varying sizes
- **Content-delivery capabilities**: Fast global access to images
- **Lifecycle management**: Automated rules for moving less-accessed images to cheaper storage tiers
- **Fine-grained access control**: Secure pre-signed URLs for temporary access
- **Versioning support**: Maintains image history when needed
- **Cost-effective**: Pay only for what you use with no minimum fees
### Decoupled Embedding Generation
We deliberately decoupled the image embedding process from the upload flow for several reasons:
1. **Upload responsiveness**: Users experience fast upload times since compute-intensive embedding generation happens asynchronously
2. **System resilience**: Upload service remains available even if embedding service experiences issues
3. **Independent scaling**: Each component can scale based on its specific resource needs
4. **Cost optimization**: Cloud Functions only run when needed, avoiding idle compute costs
5. **Processing flexibility**: Can modify embedding algorithms without affecting the core upload flow
6. **Batch processing**: Potential to batch embedding generation for further cost optimization
### Latency Considerations
- **API response times**: FastAPI provides high-performance request handling
- **Caching strategy**: Frequently accessed images and search results are cached
- **Edge deployment**: Cloud Run regional deployment optimizes for user location
- **Async processing**: Non-blocking operations for concurrent request handling
- **Embedding pre-computation**: All embeddings are generated ahead of time, making searches fast
- **Search optimization**: Vector database indices are optimized for quick similarity searches
### Cost Optimization
- **Serverless architecture**: Pay-per-use model eliminates idle infrastructure costs
- **Storage tiering**: Automatic migration of older images to cheaper storage classes
- **Compute efficiency**: Cloud Functions minimize compute costs through precise scaling
- **Caching**: Reduces repeated processing of the same data
- **Resource throttling**: Rate limits prevent unexpected usage spikes
- **Embedding dimensions**: Balancing vector size for accuracy vs. storage costs
- **Query optimization**: Efficient search patterns to minimize vector database operations
### Scalability Approach
- **Horizontal scaling**: All components can scale out rather than up
- **Stateless design**: API servers maintain no local state, enabling easy replication
- **Queue-based workload distribution**: Prevents system overload during traffic spikes
- **Database sharding capability**: Firestore automatically shards data for growth
- **Vector database partitioning**: Pinecone handles distributed vector search at scale
- **Load balancing**: Traffic distributed across multiple service instances
- **Microservice architecture**: Individual components can scale independently based on demand
### Security Architecture
- **API key authentication**: Simple but effective access control for machine-to-machine communication
- **Team-based permissions**: Multi-tenant isolation with hierarchical access controls
- **Encrypted storage**: All data encrypted at rest and in transit
- **Secret management**: Sensitive configuration isolated from application code
- **Minimal attack surface**: Limited public endpoints with appropriate rate limiting
- **Audit logging**: Comprehensive activity tracking for security analysis
### API Key Authentication System
SEREACT uses a simple API key authentication system:
#### Key Generation and Storage
- API keys are generated as cryptographically secure random strings
- Each team can have multiple API keys
- Keys are never stored in plaintext - only secure hashes are saved to the database
#### Authentication Flow
1. Client includes the API key in requests via the `X-API-Key` HTTP header
2. Auth middleware validates the key by:
- Hashing the provided key
- Querying Firestore for a matching hash
- Verifying the key belongs to the appropriate team
3. Request is either authorized to proceed or rejected with 401/403 status
#### Key Management
- API keys can be created through the API:
- User makes an authenticated request
- The system generates a new random key and returns it ONCE
- Only the hash is stored in the database
- Keys can be viewed and revoked through dedicated endpoints
- Each API key use is logged for audit purposes
#### Design Considerations
- No master/global API key exists to eliminate single points of failure
- All keys are scoped to specific teams to enforce multi-tenant isolation
- Keys are transmitted only over HTTPS to prevent interception
This authentication approach balances security with usability for machine-to-machine API interactions, while maintaining complete isolation between different teams using the system.
### Database Structure
The Firestore database is organized into collections and documents with the following structure:
#### Collections and Documents
```
firestore-root/
├── teams/ # Teams collection
│ └── {team_id}/ # Team document
│ ├── name: string # Team name
│ ├── created_at: timestamp
│ ├── updated_at: timestamp
│ │
│ ├── users/ # Team users subcollection
│ │ └── {user_id}/ # User document in team context
│ │ ├── role: string (admin, member, viewer)
│ │ ├── joined_at: timestamp
│ │ └── status: string (active, inactive)
│ │
│ ├── api_keys/ # Team API keys subcollection
│ │ └── {api_key_id}/ # API key document
│ │ ├── key_hash: string # Hashed API key value
│ │ ├── name: string # Key name/description
│ │ ├── created_at: timestamp
│ │ ├── expires_at: timestamp
│ │ ├── created_by: user_id
│ │ └── permissions: array # Specific permissions
│ │
│ └── collections/ # Image collections subcollection
│ └── {collection_id}/ # Collection document
│ ├── name: string
│ ├── description: string
│ ├── created_at: timestamp
│ ├── created_by: user_id
│ └── metadata: map # Collection-level metadata
├── users/ # Global users collection
│ └── {user_id}/ # User document
│ ├── email: string
│ ├── name: string
│ ├── created_at: timestamp
│ └── settings: map # User preferences
└── images/ # Images collection
└── {image_id}/ # Image document
├── filename: string
├── storage_path: string # GCS path
├── mime_type: string
├── size_bytes: number
├── width: number
├── height: number
├── uploaded_at: timestamp
├── uploaded_by: user_id
├── team_id: string
├── collection_id: string # Optional parent collection
├── status: string (processing, ready, error)
├── embedding_id: string # Reference to vector DB
├── metadata: map # Extracted and custom metadata
│ ├── labels: array # AI-generated labels
│ ├── colors: array # Dominant colors
│ ├── objects: array # Detected objects
│ ├── custom: map # User-defined metadata
│ └── exif: map # Original image EXIF data
└── processing/ # Processing subcollection
└── {job_id}/ # Processing job document
├── type: string # embedding, analysis, etc.
├── status: string # pending, running, complete, error
├── created_at: timestamp
├── updated_at: timestamp
├── completed_at: timestamp
└── error: string # Error message if applicable
```
#### Key Relationships and Indexes
- **Team-User**: Many-to-many relationship through the team's users subcollection
- **Team-Image**: One-to-many relationship (images belong to one team)
- **Collection-Image**: One-to-many relationship (images can belong to one collection)
- **User-Image**: One-to-many relationship (upload attribution)
#### Composite Indexes
The following composite indexes are created to support efficient queries:
1. `images` collection:
- `team_id` ASC, `uploaded_at` DESC → List recent images for a team
- `team_id` ASC, `collection_id` ASC, `uploaded_at` DESC → List recent images in a collection
- `team_id` ASC, `status` ASC, `uploaded_at` ASC → Find oldest processing images
- `uploaded_by` ASC, `uploaded_at` DESC → List user's recent uploads
2. `users` subcollection (within teams):
- `role` ASC, `joined_at` DESC → List team members by role
#### Security Rules
Firestore security rules enforce the following access patterns:
- Team admins can read/write all team data
- Team members can read all team data but can only write to collections and images
- Team viewers can only read team data
- Users can only access teams they belong to
- API keys have scoped access based on their assigned permissions
## License
This project is licensed under the MIT License - see the LICENSE file for details.
## API Modules Architecture
The SEREACT API is organized into the following key modules to ensure separation of concerns and maintainable code:
```
src/
├── api/ # API endpoints and routers
│ └── v1/ # API version 1 routes
├── auth/ # Authentication and authorization
├── config/ # Configuration management
├── models/ # Database models
├── services/ # Business logic services
└── utils/ # Utility functions
```
### Module Responsibilities
#### Router Module
- Defines API endpoints and routes
- Handles HTTP requests and responses
- Validates incoming request data
- Directs requests to appropriate services
- Implements API versioning
#### Auth Module
- Manages user authentication
- Handles API key validation and verification
- Implements role-based access control
- Provides security middleware
- Manages user sessions and tokens
#### Services Module
- Contains core business logic
- Orchestrates operations across multiple resources
- Implements domain-specific rules and workflows
- Integrates with external services (Cloud Vision, Storage)
- Handles image processing and embedding generation
#### Models Module
- Defines data structures and schemas
- Provides database entity representations
- Handles data validation and serialization
- Implements data relationships and constraints
- Manages database migrations
#### Utils Module
- Provides helper functions and utilities
- Implements common functionality used across modules
- Handles error processing and logging
- Provides formatting and conversion utilities
- Implements reusable middleware components
#### Config Module
- Manages application configuration
- Handles environment variable loading
- Provides centralized settings management
- Configures service connections and credentials
- Defines application constants and defaults
### Module Interactions
```
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ │ │ │ │ │
│ Router │ ───────▶│ Services │ ◀───────│ Config │
│ Module │ │ Module │ │ Module │
│ │ │ │ │ │
└──────┬──────┘ └──────┬──────┘ └─────────────┘
│ │
│ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ │ │ │
│ Auth │ │ Models │
│ Module │ │ Module │
│ │ │ │
└──────┬──────┘ └──────┬──────┘
│ │
│ │
└───────────────────────┘
┌─────────────┐
│ │
│ Utils │
│ Module │
│ │
└─────────────┘
```
The modules interact in the following ways:
- **Request Flow**:
- Client request arrives at the Router Module
- Auth Module validates the request authentication
- Router delegates to appropriate Service functions
- Service uses Models to interact with the database
- Service returns data to Router which formats the response
- **Cross-Cutting Concerns**:
- Config Module provides settings to all other modules
- Utils Module provides helper functions across the application
- Auth Module secures access to routes and services
- **Dependency Direction**:
- Router depends on Services and Auth
- Services depend on Models and Config
- Models depend on Utils for helper functions
- Auth depends on Models for user information
- All modules may use Utils and Config
This modular architecture provides several benefits:
- **Maintainability**: Changes in one module have minimal impact on others
- **Testability**: Modules can be tested in isolation with mocked dependencies
- **Scalability**: New features can be added by extending existing modules
- **Reusability**: Common functionality is centralized for consistent implementation
- **Security**: Authentication and authorization are handled consistently