Project Details

Production RAG system for extracting, processing, and querying Finnish immigration information. Complete pipeline from web crawling through semantic search to interactive chatbot interface. Built with LangChain, ChromaDB, and local LLM integration.

Tapio: Making Finnish Immigration Information Accessible Through RAG

Background and Motivation

After living in Finland for over a decade as an immigrant, I’ve experienced firsthand the challenge of navigating Finnish bureaucracy. Critical information about residence permits, work authorization, family reunification, and public services is scattered across government websites in dense legal language. For many immigrants, particularly those still developing Finnish language skills, finding accurate answers to urgent questions can be overwhelming.

This struggle is shared by my immigrant friends across different situations: students seeking education information, workers exploring employment options, families pursuing reunification, and refugees needing guidance through asylum processes. We all face the same fundamental problem: important information exists, but it’s difficult to find, understand, and apply to specific circumstances.

The Finnish Immigration Service (Migri) website contains comprehensive information, but its structure assumes familiarity with Finnish administrative systems and terminology. Simple questions like “Can my spouse work while we wait for their residence permit?” or “What documents do I need for family reunification?” often require reading multiple pages and cross-referencing complex regulations.

I built Tapio to address this information access gap. Rather than replacing official sources, Tapio makes existing information more accessible by allowing natural language queries that retrieve relevant content and provide conversational guidance. The name Tapio comes from Finnish mythology, the forest god who guides travelers through unknown territory—an apt metaphor for helping immigrants find their way through bureaucratic complexity.

The Solution

Tapio is a production RAG (Retrieval Augmented Generation) system that extracts, processes, and makes searchable the information from Finnish government websites like Migri.fi. The system provides conversational access to immigration information through a complete pipeline: web crawling, content parsing, vectorization, and interactive chatbot interface.

Core capabilities:

Web crawling: Configurable crawling of government sites with depth and domain restrictions
Content parsing: Extraction and cleaning of relevant information from HTML
Semantic search: ChromaDB vector database enabling retrieval of contextually relevant content
Local LLM integration: Ollama for privacy-focused local inference without sending queries to external services
Interactive chatbot: Gradio web interface for natural language questions
Multi-database support: Abstraction layer supporting ChromaDB, Pinecone, pgvector, and Astra DB

Target users:

EU and non-EU citizens navigating Finnish immigration processes
Students, workers, families, refugees, and asylum seekers
Anyone needing to understand Finnish residence permits, work authorization, or public services

How it helps: Rather than searching through multiple government pages, users can ask questions in natural language: “What happens if my residence permit expires while I’m waiting for renewal?” The system retrieves relevant passages and provides conversational guidance based on official information.

Technical Implementation

Architecture: Production RAG Pipeline

Tapio implements a complete RAG workflow using the LangChain framework:

Pipeline stages:

Web Crawling: Configurable crawler for Finnish government sites
Content Parsing: HTML cleaning and text extraction
Text Splitting: LangChain text splitters for optimal chunk sizes
Vectorization: HuggingFace embeddings for semantic representation
Storage: Vector database with semantic search capabilities
Retrieval: Context-aware content retrieval based on queries
Generation: Local LLM inference with retrieved context
Interface: Interactive chatbot with Gradio

Key Technical Decisions

LangChain Framework Chose LangChain for production-grade RAG patterns rather than building from scratch. The framework provides:

Proven text splitting strategies for optimal chunk sizes
Built-in embedding integration (HuggingFace)
Chain orchestration for retrieval and generation
Extensible architecture for different LLM backends

Local LLM with Ollama Privacy was a critical requirement. Immigration questions can be sensitive (asylum cases, family situations, legal status). Using Ollama for local inference means:

No query data sent to external services
Complete privacy for users
No API costs or rate limits
Works offline once models are downloaded

Multi-Database Abstraction Built an abstraction layer supporting multiple vector databases:

ChromaDB: Default option, easy local setup
Pinecone: Cloud-hosted option for scaling
pgvector: PostgreSQL extension for existing infrastructure
Astra DB: Managed DataStax option

This flexibility allows deployment in different environments (local development, cloud hosting, enterprise infrastructure) without rewriting core logic.

HuggingFace Embeddings Used HuggingFace models for text embeddings rather than OpenAI for consistency with the privacy-focused approach. Embeddings run locally, maintaining the zero-external-dependency principle.

Engineering Practices

Production-grade approach:

Comprehensive testing: pytest suite covering pipeline stages
Type checking: mypy for static type validation
Code quality: ruff and pre-commit hooks for consistency
Modern Python tooling: uv for dependency management
Modular architecture: Clean separation of crawling, parsing, vectorization, retrieval, and chat components

Challenges and Solutions

Balancing Comprehensiveness with Relevance

Government websites contain thousands of pages. Crawling everything creates noise; being too selective risks missing important information.

Solution: Configurable crawling with domain restrictions and depth limits. Start with core immigration information (Migri.fi residence permits section), then expand based on common question patterns. The modular architecture allows easy addition of new sources.

Handling Complex Legal Language

Finnish immigration regulations use precise legal terminology that can be difficult to parse and represent in vector embeddings.

Solution: LangChain’s text splitting preserves context around legal terms. The RAG approach provides actual source passages rather than paraphrasing, allowing users to see official wording. Chunking strategy balances preserving context with retrieval precision.

Privacy for Sensitive Queries

Immigration questions often involve sensitive personal situations. Sending queries to external LLM APIs creates privacy concerns.

Solution: Complete local inference pipeline with Ollama. No query data leaves the user’s machine. Trade-off is requiring local compute resources, but privacy benefits outweigh convenience for this use case.

Maintaining Accuracy

RAG systems can hallucinate or misinterpret information. Immigration guidance requires accuracy.

Solution: System returns source passages with citations, allowing users to verify information against official sources. Prompts emphasize providing retrieved information rather than generating new claims. Users see both the chatbot response and the source chunks it’s based on.

Development Status

What Was Built

Complete RAG pipeline:

Web crawler for Finnish government sites
HTML parser and text extraction
Vectorization with HuggingFace embeddings
Vector database integration (multiple options)
Local LLM inference with Ollama
Interactive Gradio chatbot interface
Comprehensive test suite
Type-checked codebase with quality gates

Demonstrated capabilities:

Extract and process immigration information from Migri.fi
Build searchable knowledge base with semantic retrieval
Provide conversational access to official information
Maintain privacy with local inference
Deploy in different environments with database flexibility

Current Status

The technical implementation is complete and functional. The system successfully demonstrates end-to-end RAG capabilities from web crawling through conversational queries. The codebase represents production-grade engineering practices including comprehensive testing, type checking, and modular architecture.

Why development paused: I built Tapio while starting a demanding new role, leading to overextension and burnout. Additionally, being contacted by scammers made me reconsider the project’s direction and the responsibility of providing immigration guidance.

Technical value: Despite pausing active development, Tapio demonstrates complete RAG system implementation with production engineering practices. The project proves capability beyond tutorials: building modular architectures, integrating multiple technologies (LangChain, vector databases, local LLMs), and maintaining code quality through testing and type checking.

Repository status: Code is available on GitHub as an example of production RAG implementation. The technical work is solid and could be valuable for others building similar information access systems or learning RAG architecture patterns.

Personal Reflections

Building Tapio taught me several important lessons about AI application development and personal sustainability.

Technical insights: The RAG pattern is powerful for information access problems. Rather than training custom models or fine-tuning, combining semantic search with LLM generation provides accurate, verifiable responses grounded in source material. The modular architecture (separating crawling, vectorization, retrieval, and generation) makes the system maintainable and extensible. Choosing local inference with Ollama proved the right decision for privacy-sensitive applications, even at the cost of convenience.

Personal insights: The experience reinforced the importance of sustainable commitments. Building a startup-level project while starting a demanding new job led to burnout. I learned to be more thoughtful about work-life balance and to recognize when I’m overextending myself. Setting boundaries around side projects is necessary for long-term productivity and wellbeing.

Project value: While I paused development, the technical work demonstrates real capability. The complete pipeline from web crawling through conversational interface shows end-to-end system thinking. The production engineering practices (testing, type checking, modular design) prove I can build maintainable AI applications, not just experiments.

The project remains valuable as a technical demonstration and potentially as a foundation for others tackling similar information access challenges. The core insight—that RAG can make bureaucratic information more accessible to those who need it most—remains valid even if I’m not actively developing this specific implementation.

Tapio