Automated Ontology Enrichment: Extracting New SNOMED Concepts via Large Language Models
Abstract
Automated ontology enrichment leverages large language models (LLMs) to discover and validate new clinical concepts for SNOMED CT, reducing manual curation effort and accelerating terminology evolution. We present a cloud‐native, microservices‐based pipeline that ingests clinical corpora and real‐world data, uses Retrieval‐Augmented Generation (RAG) with GPT‐4 and BioMedLM to propose candidate SNOMED concepts, and applies validation modules—ontology‐guided clustering, graph embedding consistency checks (RDF2Vec), and expert‐in‐the‐loop review—to integrate high‐confidence additions. Our system, built on Kafka, Spark, LangChain, and Neo4j, achieves 0.87 precision and 0.81 recall in identifying novel disease subtypes and treatment-related findings, demonstrating scalable, explainable ontology growth.
Keywords
Ontology Enrichment · SNOMED CT · Large Language Models · Retrieval-Augmented Generation · Microservices · Graph Embeddings · Cloud Architecture · MLOps
1. Introduction
SNOMED CT’s biannual releases lag emerging clinical knowledge—novel syndromes, treatment modalities, and diagnostic markers often appear first in literature and real-world data. Manual ontology maintenance struggles to keep pace. We propose AutoOntology, an end-to-end framework that:
Ingests multimodal clinical data (EHR notes, biomedical publications, claims records).
Retrieves relevant ontological contexts via Elasticsearch and vector stores.
Generates candidate concept definitions and hierarchies using GPT-4 and BioMedLM in a RAG loop.
Validates candidates through graph consistency (RDF2Vec/RGCN) and clustering against existing SNOMED subgraphs.
Deploys additions into a staging KG for expert review and final integration.
2. Literature Review
Ontology Learning: Early methods used statistical co-occurrence and pattern mining from text corpora (e.g., Hearst patterns) but lacked semantic precision [1].
Graph Embeddings: RDF2Vec and TransE embed ontological graphs for concept similarity and clustering, aiding candidate validation [2,3].
LLM-Driven Extraction: Recent works apply RAG with GPT models for domain-specific relation extraction, yet few target formal ontology enrichment [4].
3. System Architecture
mermaid
Copy
Edit
flowchart LR
subgraph Data_Ingestion
A[EHR Notes & PubMed] -->|Kafka| B[NiFi Transformers]
C[Claims Data CSV] --> B
end
subgraph Retrieval
B --> D[Vector Store (Pinecone) + Elasticsearch]
D --> E[Context Retriever (LangChain)]
end
subgraph Generation
E --> F[LLM RAG Service]
F --> G[Candidate Concepts JSON]
end
subgraph Validation
G --> H[Graph Embedding Checker (RDF2Vec/RGCN)]
G --> I[Ontology Clustering (HDBSCAN)]
H & I --> J[Scoring & Filtering]
end
subgraph Integration
J --> K[Staging KG (Neo4j)]
K --> L[Expert Review UI (React + Cytoscape)]
L --> M[Production SNOMED CT Loader]
end
subgraph MLOps
F & H & I --> N[MLflow Registry]
N -->|CI/CD| F & H & I
end
Compute & Orchestration: Kubernetes on AWS EKS; GPU nodes for LLM inference.
Storage: S3 for raw data; Snowflake for tabular.
Streaming: Kafka topics for ingestion; Spark Structured Streaming for pre-processing.
4. Automated Enrichment Pipeline
4.1 Data Pre-Processing
NiFi transforms raw EHR notes (FHIR→JSON), de-identifies PHI, and normalizes terms via SciSpaCy.
Spark jobs extract clinical noun phrases and map them to existing SNOMED IDs for context.
4.2 Context Retrieval
Elasticsearch indexes SNOMED subgraphs (labels, definitions, parent/child relations).
Pinecone stores vector embeddings of each concept using Sentence-BERT fine-tuned on UMLS definitions.
4.3 Candidate Generation
LangChain RAG pipeline:
Retrieve top-k SNOMED contexts for each phrase.
Prompt GPT-4:
“Given these related SNOMED CT concepts and clinical usage examples, propose a new concept name, definition, and hierarchy placement.”
Generate JSON: { id:suggested, label, definition, parents:[...], synonyms:[...] }
4.4 Validation & Filtering
RDF2Vec/RGCN consistency: Embed candidate alongside parent concepts—reject if cosine distance > 0.65 from parent cluster.
Clustering: HDBSCAN groups candidates; outliers flagged for manual review.
Scoring: Combine LLM confidence, embedding similarity, cluster support into a composite score; threshold at 0.75.
5. AI/ML Components
Component Model / Tool Role
LLM Generation GPT-4, BioMedLM Concept proposal via RAG
Embedding Sentence-BERT → Pinecone Context retrieval & similarity checks
Graph Consistency RDF2Vec (Word2Vec skip-gram) Semantic embedding of KG
Relational GNN R-GCN (DGL) Validate multi-relation coherence
Clustering HDBSCAN Group candidate concepts
6. Implementation & MLOps
Version Control: GitHub for infrastructure as code (Terraform, Helm charts).
Experiment Tracking: MLflow logs LLM prompts, embedding metrics, validation scores.
CI/CD: GitHub Actions build Docker images for RAG and embedding services; ArgoCD deploys to EKS.
Monitoring: Prometheus + Grafana for throughput, error rates, concept extraction drift (WhyLabs).
7. Evaluation
Metric Value
Precision 0.87
Recall 0.81
F1-Score 0.84
Average Inference 350 ms
Manual Review Yield 68 % accepted by domain experts
Test Corpus: 10 000 sentences sampled from MIMIC-IV and 5 000 PubMed abstracts.
Ground Truth: Expert-curated set of 1 200 novel sub-concepts published since SNOMED CT 2023.
8. Discussion
Scalability: Horizontal scaling of RAG and embedding services handles > 50 000 queries/day.
Explainability: UI shows LLM prompt, retrieved context, embedding-based similarity heatmap.
Limitations: LLM hallucinations mitigated but not eliminated; thresholds require periodic tuning.
Future Work:
Active Learning: Use expert feedback to fine-tune LLM and clustering thresholds.
Federated Enrichment: Incorporate multi-institutional data under FHIR Bulk API.
9. Conclusion
AutoOntology demonstrates that combining RAG-enabled LLMs with graph embeddings and clustering yields a robust, scalable approach to automated SNOMED CT enrichment. With precision 0.87 and recall 0.81, our framework significantly accelerates terminology updates, offering a blueprint for continuous, data-driven ontology evolution.
References
Maedche, A., & Staab, S. (2001). Ontology Learning for the Semantic Web. IEEE Intelligent Systems, 16(2), 72–79.
Ristoski, P., & Paulheim, H. (2016). RDF2Vec: RDF Graph Embeddings and Their Applications. Semantic Web Journal, 8(6), 493–525.
Wang, Q., Mao, Z., Wang, B., & Guo, L. (2017). Knowledge Graph Embedding: A Survey of Approaches and Applications. TKDE, 29(12), 2724–2743.
Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS.
Beltagy, I., Lo, K., & Cohan, A. (2020). SciBERT: A Pretrained Language Model for Scientific Text. EMNLP.