Principal Engineer - GenAI, Big Data

Pakistan

Full Time

Experienced

Role - Principal Engineer
Location - Hybrid for Islamabad
Remote for other cities

Role Summary:
Lead the design, architecture, and development of next-generation intelligent systems at the intersection of Generative AI, Big Data, and Cloud. Define technical roadmaps, build production-grade multi-agent platforms, and drive innovation across distributed systems, lakehouse architectures, and LLMOps. Operate as a hands-on technical leader and mentor without formal management responsibilities.

Key Responsibilities

Design and deliver production-grade RAG systems with embedding refresh strategies, vector DB synchronization, and hybrid search.
Architect and implement AI agent orchestration frameworks (ReAct, multi-agent coordination, persistent state, error recovery, observability).
Build scalable event-driven architectures with idempotency, exactly-once/at-least-once semantics, poison message handling, and backpressure management.
Contribute to Lakehouse data architectures (Delta Lake, Iceberg, Hudi), addressing schema evolution, compaction, and ACID transactions on object storage.
Develop high-performance ML/LLM code for real-time pipelines, extending frameworks when required.
Collaborate with data scientists and platform engineers to accelerate model experimentation, validation, and deployment.
Define and implement LLMOps strategies including prompt versioning, token cost tracking, evaluation, and personalization.
Drive architectural vision through design/code reviews, mentorship, and thought leadership.
Innovate in Generative AI, distributed systems, and intelligent platforms from concept through delivery.

Must-Have Skills & Tools

3+ years of building and deploying ML/LLM solutions in production (RAG, LLM fine-tuning, embeddings).
Hands-on expertise with RAG system design: document chunking, vector DB synchronization, retrieval evaluation.
Deep knowledge of indexing algorithms (HNSW, IVF, LSH) and hybrid search.
Proven experience with agent orchestration frameworks (LangGraph, AutoGen, CrewAI, or custom).
Strong background in distributed systems and event-driven architectures (Kafka, Debezium, CDC, DLQs).
Cloud-native development expertise (AWS).
Strong programming skills (Python).

Nice-to-Have Skills

Experience with Graph ML and Graph RAG (ontologies, semantic layers, GNNs).
Familiarity with Big Data tools (Spark, Flink, PySpark, Glue, Druid).
Hands-on work with Lakehouse technologies (Delta, Iceberg, Hudi).
Designing evaluation frameworks for LLMs and multi-agent systems.
Experience handling unstructured data pipelines (PDFs, tables, images) and real-time personalization.

Soft Skills / Traits

Strong problem-solving in complex and ambiguous scenarios.
Excellent collaboration across data, AI, and engineering teams.
Ability to mentor peers and influence architectural decisions.
Clear technical communication skills for design reviews and cross-team discussions.

Apply for this position

Required*

First Name*

Last Name*

Email Address*

Phone*

Address

Resume*

We've received your resume. Click here to update it.

Attach resume or Paste resume

Attach resume as .pdf, .doc, .docx, .odt, .txt, or .rtf (limit 5MB) or Paste resume

Paste your resume here or Attach resume file

Can you describe a production-grade RAG system you have designed or worked on?*

Which vector DBs have you worked with (e.g., Pinecone, Weaviate, Milvus, FAISS)?*

How do you evaluate the quality of retrieval in a RAG pipeline?*

Have you worked with LangGraph, AutoGen, CrewAI, or custom agent orchestration frameworks? Did you implement features like error recovery, persistent state, or observability?*

Walk us through a case where multiple agents collaborated to solve a problem?*

Can you walk us through one LLM-based system you’ve deployed to production? What challenges did you face in scaling, monitoring, and updating it?*

Have you used LangGraph, AutoGen, or CrewAI? Pick one and explain how you structured a multi-agent workflow. What challenges did you face in ensuring determinism, reliability, and error recovery?*

Human Check*

Submit Application