LLM-powered Semantic Search Platform

🧬 SRA Metadata Search Tool

Role: AI Programmer
Company: D24H
Project Period: Jan 2025 – Jun 2025

“Developed a production-grade, open-source, natural language search engine for the world’s largest public genomic resource — the Sequence Read Archive (SRA) — enabling researchers to query and filter 35M+ records using semantic search, advanced filters, and state-of-the-art AI models.”

Project Highlights

  • LLM Integration: Built an end-to-end query interpretation pipeline using Gemini-2.0-flash-lite for natural language parsing, typo correction, and metadata extraction.
  • Semantic & ANN Search: Engineered fast, accurate semantic search with BERT-based SentenceTransformers and pgvector, supporting <0.2s query speed on 35M+ samples.
  • Advanced Filtering: Supported complex queries on organism, geography, time, sequencing platform, and more, combining both semantic and structured filters.
  • Reranking & Relevance: Integrated transformer-based cross-encoder models (Jina AI) for context-aware reranking of search results.
  • User Interface: Designed a researcher-friendly UI for flexible searching and immediate data exploration, removing the need for SQL or bioinformatics expertise.
  • Scalable Infrastructure: Leveraged PostgreSQL 16, pgvector 0.8.0, and efficient embedding storage to achieve high throughput and low-cost querying at scale.

Key Technologies: Python, Flask, PostgreSQL, pgvector, SentenceTransformers, JinaAI, Pandas, Torch, LLMs

References:

GitHub