I build data-intensive backend systems, ML platforms, MLOps workflows, and GenAI applications that move beyond notebooks into production-grade engineering.
3.5+ Years ExperienceLeetCode 1875M.Tech Data ScienceEx-Capgemini
I work where software engineering meets data systems and applied ML.
I work where software engineering meets data systems and applied ML. Three-and-a-half years at Capgemini's financial services practice shaped how I think: pipelines that don't drop records, models that ship behind evaluation gates, GenAI that runs on a budget.
The result is a profile that sits comfortably in software engineering, data engineering, ML systems, and GenAI — without pretending to be a researcher or a generic full-stack dev.
Multi-agent retrieval, hybrid search, prompt + cost optimization on GPT-4/3.5.
03Impact
Selected metrics from production work and research projects.
01
~200 GB
Financial Data Processed
Compressed, distributed Spark ingestion on AWS EMR
02
~60%
Deployment Cycle Reduced
Airflow DAGs for retraining, evaluation gating, versioned CI/CD
03
30–40%
Manual Effort Cut
GPT-4/3.5 backend for financial document extraction
04
35–50%
Inference Cost Reduced
Hybrid retrieval, caching, parallel agents in RAG pipeline
05
30–40%
Factual Accuracy Gain
Agent validation loops + dense/sparse retrieval
06
90%
Classification Accuracy
Federated learning with differential privacy on medical data
07
1875
LeetCode Rating
Active problem-solving across DSA topics
08
2 × 2
Promotions / Top Quarters
Two promotions in two years; top performer two consecutive quarters
// Numbers reflect personal contributions; some bands provided as ranges to avoid overclaim.
04Experience
Production engineering across financial services.
Financial Services
Associate Consultant — Data Science & Machine Learning
Capgemini·Mumbai, India·Hybrid
May 2021 — Sep 2024
Distributed ingestion service
Built a fault-tolerant ingestion service in Python and Apache Spark on AWS EMR processing ~200 GB of compressed financial data, with schema validation, dead-letter queues, and partitioned writes to S3.
Risk & fraud ML pipeline
Built modular components spanning feature engineering, training, threshold optimization, batch inference, and evaluation — improving AUC/F1 and cutting production false negatives.
MLOps orchestration
Orchestrated Airflow workflows on AWS (S3, EC2, EMR) automating DAGs for data validation, retraining, evaluation gating, and versioned CI/CD — reducing deployment cycle by ~60%.
GenAI document processing
Engineered a GPT-4/GPT-3.5 backend for extraction and summarization of financial documents with prompt engineering, caching, and batching — cutting manual effort by 30–40%.
Recognition
Top performer for 2 consecutive quarters; promoted twice within 2 years for delivering production-grade data and ML systems.
PythonSparkAirflowAWS EMRS3GPT-4PostgreSQLDocker
Early Career
Data Science Intern
Capgemini·Pune, India·On-site
Jul 2019 — Sep 2019
Automated ETL pipeline
Built an automated ETL pipeline for extraction and transformation, improving workflow efficiency by ~40% and reducing manual intervention.
Analytics dashboards
Developed interactive analytics dashboards over processed datasets, surfacing operational metrics that supported leadership decisions and contributed to a ~35% reduction in operational expenditure.
PythonETLDashboards
05Featured Projects
Real systems shipped — open source, research, and production case studies.
Open Project01 / 04
FiFantasy — End-to-end fantasy football platform for the FIFA World Cup 2026
End-to-end deployed: data engineering, ML rating model, realtime backend, system-design invariants, multi-project ops — solo, every layer.
Solo build, end-to-end deployed. Owned every layer — ingestion connectors, entity matching, layered rating model, simulator, realtime scoring engine, Next.js app, and ops. The interesting decisions live as much in the migrations folder and region pinning as in the model layers. Mirrors patterns I work with professionally (idempotent pipelines, schema-as-invariant, regional deploy tuning) on a stack owned from schema to UI to production.
Architecture
└─[Data Engineering] Idempotent ETL across 4 source connectors (REST · CSV · Kaggle · JSON) into Postgres via 22 numbered migrations, with an ingestion_runs audit table and rate-limit-aware backoff so reruns are safe by construction
└─[ML] 4-layer position-bucketed rating model: baseline + age curve → market-value z-score → Gemini LLM augmentation → international pedigree; blend weights driven by entity-match confidence so the model degrades gracefully on missing data
└─[System Design] Partial unique indexes (UNIQUE … WHERE status = 'pending') and pure idempotent scoring functions enforce auction, trade, and scoring invariants in the database — not in app code that can drift
└─[Realtime / Performance] Supabase logical replication → channels; Tokyo → Mumbai DB region migration measured ~100ms p95 improvement (~250ms → ~150ms cross-continent), executed via a two-phase migration script kept in the repo
└─[DevOps] Two Vercel projects (private league + public guest demo) deployed from one branch via env-driven SITE_MODE flag; pinned bom1 server region to co-locate with Mumbai DB (~5ms server↔DB vs ~120ms pre-pin)
Production RAG system that generates cited, hallucination-checked articles from your own documents
End-to-end deployed RAG: heading-aware ingestion, hybrid retrieval with ML reranker, multi-agent generation, four-layer hallucination defense, Dockerized full stack.
Solo build of a production-grade RAG pipeline that grounds every paragraph in user-uploaded documents. Owned the full stack — FastAPI backend, Next.js frontend, FAISS vector store, local sentence-transformers embeddings, a scikit-learn reranker, a four-agent orchestrator, and Docker Compose deployment with published GHCR images. Designed against the failure modes RAG systems actually hit in production (bad chunking, lost-in-the-middle context, citation drift, hallucination at chunk boundaries) and documented each design decision against alternatives.
Architecture
└─[Data Engineering] Heading-aware semantic chunking (512-word windows, 64-word overlap) with Unicode NFC normalization, ligature expansion, and SHA-256 source IDs for deduplication across PDF / Markdown / TXT inputs
└─[ML] Local sentence-transformers MiniLM-L6-v2 embeddings (384-dim, ~14.5k tok/s on CPU) indexed in FAISS IndexFlatIP wrapped in IndexIDMap2 — exact cosine search, zero external embedding cost
└─[ML] 6-feature scikit-learn gradient-boosted reranker (cosine, BM25 keyword overlap, query coverage, heading match, length, position) rescores FAISS top-8 → top-4 before generation, so retrieval failures don't poison the writer
└─[System Design] Four-agent pipeline (Planner → Retriever → Writer → Critic) with a shared PipelineContext object; Writer is prompt-constrained to source-only generation and inserts [CITE:chunk_id] markers that Critic verifies
└─[ML / Evaluation] 4-layer QA defense — citation-based grounding, embedding-based consistency, Flesch readability, retrieved-chunk coverage — combined into a passable/failable confidence score that gates output
└─[DevOps] Dockerized full stack with a single docker-compose up; production profile adds nginx reverse proxy + bearer-auth gating; GHCR-published images and a pytest suite covering ingestion, vector store, and orchestrator
Federated Transfer Learning for Monkeypox Diagnosis
Privacy-preserving federated ML framework for medical image classification.
A federated learning system across distributed nodes with differential-privacy guarantees (ε-DP), enabling collaborative model training on sensitive medical data without raw data sharing.
Architecture
└─Distributed nodes train locally; only updates exchanged
└─Differential privacy noise added to gradients (ε-DP)
└─Transfer learning from pretrained vision backbones
└─Evaluated on constrained real-world medical data
Production data + ML platform powering risk decisions in financial services.
End-to-end platform delivered at Capgemini: distributed ingestion, modular ML pipeline, Airflow-orchestrated retraining and evaluation gating. Closed-source — described here from professional experience.
Architecture
└─Spark on EMR for distributed ingestion of compressed financial data