S.P./
01Index · Suyash Patil

Suyash Pradeep Patil

>

Software Engineer — Data & AI Systems

I build data-intensive backend systems, ML platforms, MLOps workflows, and GenAI applications that move beyond notebooks into production-grade engineering.

3.5+ Years ExperienceLeetCode 1875M.Tech Data ScienceEx-Capgemini
SYSTEM_INFOONLINE
LOCATION
Bhopal, India
ROLE
SDE / ML / Data & AI
EXPERIENCE
3.5+ years
EDUCATION
M.Tech, IIIT Bhopal
STATUS
Open to roles
DSA
LeetCode 1875
VIEWS
...
200GBProcessed
60%Deploy ↓
1875LeetCode
02Positioning

I work where software engineering meets data systems and applied ML.

I work where software engineering meets data systems and applied ML. Three-and-a-half years at Capgemini's financial services practice shaped how I think: pipelines that don't drop records, models that ship behind evaluation gates, GenAI that runs on a budget.

The result is a profile that sits comfortably in software engineering, data engineering, ML systems, and GenAI — without pretending to be a researcher or a generic full-stack dev.

01

Software Engineering

REST APIs, microservices, async processing, event-driven workflows.

02

Data Engineering

Distributed Spark ingestion, Airflow DAGs, schema validation, batch & stream.

03

ML Systems

Feature pipelines, training, threshold tuning, batch inference, evaluation gates.

04

GenAI / RAG

Multi-agent retrieval, hybrid search, prompt + cost optimization on GPT-4/3.5.

03Impact

Selected metrics from production work and research projects.

01
~200 GB
Financial Data Processed

Compressed, distributed Spark ingestion on AWS EMR

02
~60%
Deployment Cycle Reduced

Airflow DAGs for retraining, evaluation gating, versioned CI/CD

03
30–40%
Manual Effort Cut

GPT-4/3.5 backend for financial document extraction

04
35–50%
Inference Cost Reduced

Hybrid retrieval, caching, parallel agents in RAG pipeline

05
30–40%
Factual Accuracy Gain

Agent validation loops + dense/sparse retrieval

06
90%
Classification Accuracy

Federated learning with differential privacy on medical data

07
1875
LeetCode Rating

Active problem-solving across DSA topics

08
2 × 2
Promotions / Top Quarters

Two promotions in two years; top performer two consecutive quarters

// Numbers reflect personal contributions; some bands provided as ranges to avoid overclaim.

04Experience

Production engineering across financial services.

Financial Services

Associate Consultant — Data Science & Machine Learning

Capgemini·Mumbai, India·Hybrid

May 2021 — Sep 2024
Distributed ingestion service

Built a fault-tolerant ingestion service in Python and Apache Spark on AWS EMR processing ~200 GB of compressed financial data, with schema validation, dead-letter queues, and partitioned writes to S3.

Risk & fraud ML pipeline

Built modular components spanning feature engineering, training, threshold optimization, batch inference, and evaluation — improving AUC/F1 and cutting production false negatives.

MLOps orchestration

Orchestrated Airflow workflows on AWS (S3, EC2, EMR) automating DAGs for data validation, retraining, evaluation gating, and versioned CI/CD — reducing deployment cycle by ~60%.

GenAI document processing

Engineered a GPT-4/GPT-3.5 backend for extraction and summarization of financial documents with prompt engineering, caching, and batching — cutting manual effort by 30–40%.

Recognition

Top performer for 2 consecutive quarters; promoted twice within 2 years for delivering production-grade data and ML systems.

PythonSparkAirflowAWS EMRS3GPT-4PostgreSQLDocker
Early Career

Data Science Intern

Capgemini·Pune, India·On-site

Jul 2019 — Sep 2019
Automated ETL pipeline

Built an automated ETL pipeline for extraction and transformation, improving workflow efficiency by ~40% and reducing manual intervention.

Analytics dashboards

Developed interactive analytics dashboards over processed datasets, surfacing operational metrics that supported leadership decisions and contributed to a ~35% reduction in operational expenditure.

PythonETLDashboards
05Featured Projects

Real systems shipped — open source, research, and production case studies.

Open Project01 / 04

FiFantasy — End-to-end fantasy football platform for the FIFA World Cup 2026

End-to-end deployed: data engineering, ML rating model, realtime backend, system-design invariants, multi-project ops — solo, every layer.

Solo build, end-to-end deployed. Owned every layer — ingestion connectors, entity matching, layered rating model, simulator, realtime scoring engine, Next.js app, and ops. The interesting decisions live as much in the migrations folder and region pinning as in the model layers. Mirrors patterns I work with professionally (idempotent pipelines, schema-as-invariant, regional deploy tuning) on a stack owned from schema to UI to production.

Architecture
  • └─[Data Engineering] Idempotent ETL across 4 source connectors (REST · CSV · Kaggle · JSON) into Postgres via 22 numbered migrations, with an ingestion_runs audit table and rate-limit-aware backoff so reruns are safe by construction
  • └─[ML] 4-layer position-bucketed rating model: baseline + age curve → market-value z-score → Gemini LLM augmentation → international pedigree; blend weights driven by entity-match confidence so the model degrades gracefully on missing data
  • └─[System Design] Partial unique indexes (UNIQUE … WHERE status = 'pending') and pure idempotent scoring functions enforce auction, trade, and scoring invariants in the database — not in app code that can drift
  • └─[Realtime / Performance] Supabase logical replication → channels; Tokyo → Mumbai DB region migration measured ~100ms p95 improvement (~250ms → ~150ms cross-continent), executed via a two-phase migration script kept in the repo
  • └─[DevOps] Two Vercel projects (private league + public guest demo) deployed from one branch via env-driven SITE_MODE flag; pinned bom1 server region to co-locate with Mumbai DB (~5ms server↔DB vs ~120ms pre-pin)
~100ms
p95 latency cut via region migration
47k → 94%
Players matched, high/med confidence
Distributed SystemsETL PipelinesStatistical ModelingLLM IntegrationCloud DeploymentPostgreSQLTypeScript
Open Project02 / 04

Production RAG system that generates cited, hallucination-checked articles from your own documents

End-to-end deployed RAG: heading-aware ingestion, hybrid retrieval with ML reranker, multi-agent generation, four-layer hallucination defense, Dockerized full stack.

Solo build of a production-grade RAG pipeline that grounds every paragraph in user-uploaded documents. Owned the full stack — FastAPI backend, Next.js frontend, FAISS vector store, local sentence-transformers embeddings, a scikit-learn reranker, a four-agent orchestrator, and Docker Compose deployment with published GHCR images. Designed against the failure modes RAG systems actually hit in production (bad chunking, lost-in-the-middle context, citation drift, hallucination at chunk boundaries) and documented each design decision against alternatives.

Architecture
  • └─[Data Engineering] Heading-aware semantic chunking (512-word windows, 64-word overlap) with Unicode NFC normalization, ligature expansion, and SHA-256 source IDs for deduplication across PDF / Markdown / TXT inputs
  • └─[ML] Local sentence-transformers MiniLM-L6-v2 embeddings (384-dim, ~14.5k tok/s on CPU) indexed in FAISS IndexFlatIP wrapped in IndexIDMap2 — exact cosine search, zero external embedding cost
  • └─[ML] 6-feature scikit-learn gradient-boosted reranker (cosine, BM25 keyword overlap, query coverage, heading match, length, position) rescores FAISS top-8 → top-4 before generation, so retrieval failures don't poison the writer
  • └─[System Design] Four-agent pipeline (Planner → Retriever → Writer → Critic) with a shared PipelineContext object; Writer is prompt-constrained to source-only generation and inserts [CITE:chunk_id] markers that Critic verifies
  • └─[ML / Evaluation] 4-layer QA defense — citation-based grounding, embedding-based consistency, Flesch readability, retrieved-chunk coverage — combined into a passable/failable confidence score that gates output
  • └─[DevOps] Dockerized full stack with a single docker-compose up; production profile adds nginx reverse proxy + bearer-auth gating; GHCR-published images and a pytest suite covering ingestion, vector store, and orchestrator
top-8 → 4
FAISS + ML rerank pipeline
4-layer
Hallucination defense
RAG SystemsMulti-Agent OrchestrationVector SearchML RerankingFastAPIDockerPython
Research03 / 04

Federated Transfer Learning for Monkeypox Diagnosis

Privacy-preserving federated ML framework for medical image classification.

A federated learning system across distributed nodes with differential-privacy guarantees (ε-DP), enabling collaborative model training on sensitive medical data without raw data sharing.

Architecture
  • └─Distributed nodes train locally; only updates exchanged
  • └─Differential privacy noise added to gradients (ε-DP)
  • └─Transfer learning from pretrained vision backbones
  • └─Evaluated on constrained real-world medical data
90%
Classification accuracy
ε-DP
Privacy guarantee
PyTorchFederated LearningDifferential PrivacyMedical Imaging
Professional Case Study04 / 04

Financial Risk & Fraud ML Platform

Production data + ML platform powering risk decisions in financial services.

End-to-end platform delivered at Capgemini: distributed ingestion, modular ML pipeline, Airflow-orchestrated retraining and evaluation gating. Closed-source — described here from professional experience.

Architecture
  • └─Spark on EMR for distributed ingestion of compressed financial data
  • └─Schema validation + dead-letter queues + partitioned S3 writes
  • └─Modular ML pipeline: features → training → thresholds → batch inference
  • └─Airflow DAGs gate deployment behind evaluation metrics
  • └─Versioned CI/CD for model artifacts
~200 GB
Data processed
~60%
Deploy cycle cut
SparkAirflowAWS EMR / S3 / EC2PythonDockerCI/CD
Closed source · described from professional experience
06Publications

Research published, under review, and submitted.

A Federated Transfer Learning Framework with Differential Privacy for Secure Monkeypox Diagnosis
IEEE SCEECS
2025
Published
G²HAN: Geometry-Guided Hierarchical Attention Network for Insect Sound Classification
Scientific Reports (Nature)
2026
Published
Hybrid Representation Learning for Correlation-Guided Insect Acoustic Phenotyping
Scientific Reports
2026
Under Review
TriSpectralNet: A Deep Learning-Based Acoustic Monitoring of Insects in Agricultural Environments
Computers and Electronics in Agriculture
2026
Under Review
07Technical Stack

The actual tools — not the buzzword cloud.

Programming01
PythonSQLTypeScript (light)
Backend & Systems02
REST APIsMicroservicesAsync ProcessingEvent-Driven ArchitectureDistributed SystemsFault Tolerance
Data Engineering03
Apache SparkApache AirflowKafkaETL / ELTSchema ValidationBatch & Stream Processing
ML / AI04
Scikit-learnPyTorchPandas / NumPyFeature EngineeringModel EvaluationThreshold Optimization
Generative AI05
RAGMulti-Agent SystemsGPT-4 / GPT-3.5Prompt EngineeringVector StoresHybrid Retrieval
Cloud & MLOps06
AWS S3 / EC2 / EMR / LambdaDockerCI/CDSnowflakeModel Versioning
Databases07
PostgreSQLDynamoDBMongoDB
Visualization08
TableauPower BI
08Credentials

Achievements and certifications.

Achievement

Smart India Hackathon 2024

National Runner-Up

Led a team of six to build a publication data analytics solution — improved data processing efficiency by 25%.

Achievement

Capgemini Top Performer

Two consecutive quarters

Recognized for delivering production-grade data and ML systems across financial services use cases.

Achievement

Promoted twice in 2 years

Capgemini

Recognition for engineering depth and consistent on-call delivery in financial services.

Achievement

CodeChef Rank < 50 (×2)

Competitive Programming

Achieved a sub-50 global rank in two contests.

Achievement

LeetCode Rating 1875

Active practice

Consistent DSA practice across topics; profile public.

Certifications
Oracle Certified Data Science Professional
Oracle·OCI-DS
Oracle Certified Generative AI Professional
Oracle·OCI-GENAI
Azure Fundamentals
Microsoft·AZ-900
Deep Learning Specialization
DeepLearning.AI·COURSERA
Databases & SQL for Data Science with Python
IBM·IBM-SQL
09Education
Aug 2024 — Apr 2026

M.Tech in Data Science

Indian Institute of Information Technology, Bhopal

GPA 9.12 / 10 · Ongoing

Jul 2016 — May 2020

B.E. in Computer Science

Savitribai Phule Pune University

Engineering fundamentals + first SWE internship.

10Contact

Best way to reach me — for roles, collaboration, or interesting problems.

RESUME.PDFCURRENT

One-page resume detailing my experience in distributed data ingestion, ML pipelines, MLOps orchestration, and GenAI systems — with metrics and stack.

Download Resume
CURRENTLY

Open to SDE, ML, Data & AI Engineering roles. Based in Bhopal, India.

DROP_A_NOTE// takes 20 seconds
OPEN
// your details land in my inbox — no marketing, ever