01Index · Suyash Patil

Suyash Pradeep Patil

Software Engineer — Data & AI Systems

I build data-intensive backend systems, ML platforms, MLOps workflows, and GenAI applications that move beyond notebooks into production-grade engineering.

3.5+ Years ExperienceLeetCode 1875M.Tech Data ScienceEx-Capgemini

View Projects Resume GitHub LinkedIn

SYSTEM_INFOONLINE

LOCATION: Bhopal, India
ROLE: SDE / ML / Data & AI
EXPERIENCE: 3.5+ years
EDUCATION: M.Tech, IIIT Bhopal
STATUS: Open to roles
DSA: LeetCode 1875
VIEWS: ...

200GBProcessed

60%Deploy ↓

1875LeetCode

02Positioning

I work where software engineering meets data systems and applied ML.

I work where software engineering meets data systems and applied ML. Three-and-a-half years at Capgemini's financial services practice shaped how I think: pipelines that don't drop records, models that ship behind evaluation gates, GenAI that runs on a budget.

The result is a profile that sits comfortably in software engineering, data engineering, ML systems, and GenAI — without pretending to be a researcher or a generic full-stack dev.

Software Engineering

REST APIs, microservices, async processing, event-driven workflows.

Data Engineering

Distributed Spark ingestion, Airflow DAGs, schema validation, batch & stream.

ML Systems

Feature pipelines, training, threshold tuning, batch inference, evaluation gates.

GenAI / RAG

Multi-agent retrieval, hybrid search, prompt + cost optimization on GPT-4/3.5.

03Impact

Selected metrics from production work and research projects.

~200 GB

Financial Data Processed

Compressed, distributed Spark ingestion on AWS EMR

~60%

Deployment Cycle Reduced

Airflow DAGs for retraining, evaluation gating, versioned CI/CD

30–40%

Manual Effort Cut

GPT-4/3.5 backend for financial document extraction

35–50%

Inference Cost Reduced

Hybrid retrieval, caching, parallel agents in RAG pipeline

30–40%

Factual Accuracy Gain

Agent validation loops + dense/sparse retrieval

90%

Classification Accuracy

Federated learning with differential privacy on medical data

1875

LeetCode Rating

Active problem-solving across DSA topics

2 × 2

Promotions / Top Quarters

Two promotions in two years; top performer two consecutive quarters

// Numbers reflect personal contributions; some bands provided as ranges to avoid overclaim.

04Experience

Production engineering across financial services.

Financial Services

Associate Consultant — Data Science & Machine Learning

Capgemini·Mumbai, India·Hybrid

May 2021 — Sep 2024

Distributed ingestion service

Built a fault-tolerant ingestion service in Python and Apache Spark on AWS EMR processing ~200 GB of compressed financial data, with schema validation, dead-letter queues, and partitioned writes to S3.

Risk & fraud ML pipeline

Built modular components spanning feature engineering, training, threshold optimization, batch inference, and evaluation — improving AUC/F1 and cutting production false negatives.

MLOps orchestration

Orchestrated Airflow workflows on AWS (S3, EC2, EMR) automating DAGs for data validation, retraining, evaluation gating, and versioned CI/CD — reducing deployment cycle by ~60%.

GenAI document processing

Engineered a GPT-4/GPT-3.5 backend for extraction and summarization of financial documents with prompt engineering, caching, and batching — cutting manual effort by 30–40%.

Recognition

Top performer for 2 consecutive quarters; promoted twice within 2 years for delivering production-grade data and ML systems.

PythonSparkAirflowAWS EMRS3GPT-4PostgreSQLDocker

Early Career

Data Science Intern

Capgemini·Pune, India·On-site

Jul 2019 — Sep 2019

Automated ETL pipeline

Built an automated ETL pipeline for extraction and transformation, improving workflow efficiency by ~40% and reducing manual intervention.

Analytics dashboards

Developed interactive analytics dashboards over processed datasets, surfacing operational metrics that supported leadership decisions and contributed to a ~35% reduction in operational expenditure.

PythonETLDashboards

05Featured Projects

Real systems shipped — open source, research, and production case studies.

Open Project01 / 04

FiFantasy — End-to-end fantasy football platform for the FIFA World Cup 2026

End-to-end deployed: data engineering, ML rating model, realtime backend, system-design invariants, multi-project ops — solo, every layer.

Solo build, end-to-end deployed. Owned every layer — ingestion connectors, entity matching, layered rating model, simulator, realtime scoring engine, Next.js app, and ops. The interesting decisions live as much in the migrations folder and region pinning as in the model layers. Mirrors patterns I work with professionally (idempotent pipelines, schema-as-invariant, regional deploy tuning) on a stack owned from schema to UI to production.

Architecture

└─[Data Engineering] Idempotent ETL across 4 source connectors (REST · CSV · Kaggle · JSON) into Postgres via 22 numbered migrations, with an ingestion_runs audit table and rate-limit-aware backoff so reruns are safe by construction
└─[ML] 4-layer position-bucketed rating model: baseline + age curve → market-value z-score → Gemini LLM augmentation → international pedigree; blend weights driven by entity-match confidence so the model degrades gracefully on missing data
└─[System Design] Partial unique indexes (UNIQUE … WHERE status = 'pending') and pure idempotent scoring functions enforce auction, trade, and scoring invariants in the database — not in app code that can drift
└─[Realtime / Performance] Supabase logical replication → channels; Tokyo → Mumbai DB region migration measured ~100ms p95 improvement (~250ms → ~150ms cross-continent), executed via a two-phase migration script kept in the repo
└─[DevOps] Two Vercel projects (private league + public guest demo) deployed from one branch via env-driven SITE_MODE flag; pinned bom1 server region to co-locate with Mumbai DB (~5ms server↔DB vs ~120ms pre-pin)

~100ms

p95 latency cut via region migration

47k → 94%

Players matched, high/med confidence

Distributed SystemsETL PipelinesStatistical ModelingLLM IntegrationCloud DeploymentPostgreSQLTypeScript

Demo

Open Project02 / 04

Production RAG system that generates cited, hallucination-checked articles from your own documents

End-to-end deployed RAG: heading-aware ingestion, hybrid retrieval with ML reranker, multi-agent generation, four-layer hallucination defense, Dockerized full stack.

Solo build of a production-grade RAG pipeline that grounds every paragraph in user-uploaded documents. Owned the full stack — FastAPI backend, Next.js frontend, FAISS vector store, local sentence-transformers embeddings, a scikit-learn reranker, a four-agent orchestrator, and Docker Compose deployment with published GHCR images. Designed against the failure modes RAG systems actually hit in production (bad chunking, lost-in-the-middle context, citation drift, hallucination at chunk boundaries) and documented each design decision against alternatives.

Architecture

└─[Data Engineering] Heading-aware semantic chunking (512-word windows, 64-word overlap) with Unicode NFC normalization, ligature expansion, and SHA-256 source IDs for deduplication across PDF / Markdown / TXT inputs
└─[ML] Local sentence-transformers MiniLM-L6-v2 embeddings (384-dim, ~14.5k tok/s on CPU) indexed in FAISS IndexFlatIP wrapped in IndexIDMap2 — exact cosine search, zero external embedding cost
└─[ML] 6-feature scikit-learn gradient-boosted reranker (cosine, BM25 keyword overlap, query coverage, heading match, length, position) rescores FAISS top-8 → top-4 before generation, so retrieval failures don't poison the writer
└─[System Design] Four-agent pipeline (Planner → Retriever → Writer → Critic) with a shared PipelineContext object; Writer is prompt-constrained to source-only generation and inserts [CITE:chunk_id] markers that Critic verifies
└─[ML / Evaluation] 4-layer QA defense — citation-based grounding, embedding-based consistency, Flesch readability, retrieved-chunk coverage — combined into a passable/failable confidence score that gates output
└─[DevOps] Dockerized full stack with a single docker-compose up; production profile adds nginx reverse proxy + bearer-auth gating; GHCR-published images and a pytest suite covering ingestion, vector store, and orchestrator

top-8 → 4

FAISS + ML rerank pipeline

4-layer

Hallucination defense

RAG SystemsMulti-Agent OrchestrationVector SearchML RerankingFastAPIDockerPython

Demo

Research03 / 04

Federated Transfer Learning for Monkeypox Diagnosis

Privacy-preserving federated ML framework for medical image classification.

A federated learning system across distributed nodes with differential-privacy guarantees (ε-DP), enabling collaborative model training on sensitive medical data without raw data sharing.

Architecture

└─Distributed nodes train locally; only updates exchanged
└─Differential privacy noise added to gradients (ε-DP)
└─Transfer learning from pretrained vision backbones
└─Evaluated on constrained real-world medical data

90%

Classification accuracy

ε-DP

Privacy guarantee

PyTorchFederated LearningDifferential PrivacyMedical Imaging

Paper

Professional Case Study04 / 04

Financial Risk & Fraud ML Platform

Production data + ML platform powering risk decisions in financial services.

End-to-end platform delivered at Capgemini: distributed ingestion, modular ML pipeline, Airflow-orchestrated retraining and evaluation gating. Closed-source — described here from professional experience.

Architecture

└─Spark on EMR for distributed ingestion of compressed financial data
└─Schema validation + dead-letter queues + partitioned S3 writes
└─Modular ML pipeline: features → training → thresholds → batch inference
└─Airflow DAGs gate deployment behind evaluation metrics
└─Versioned CI/CD for model artifacts

~200 GB

Data processed

~60%

Deploy cycle cut

SparkAirflowAWS EMR / S3 / EC2PythonDockerCI/CD

Closed source · described from professional experience

06Publications

Research published, under review, and submitted.

Title

Venue

Year

Status

A Federated Transfer Learning Framework with Differential Privacy for Secure Monkeypox Diagnosis

IEEE SCEECS

2025

Published

G²HAN: Geometry-Guided Hierarchical Attention Network for Insect Sound Classification

Scientific Reports (Nature)

2026

Published

Hybrid Representation Learning for Correlation-Guided Insect Acoustic Phenotyping

Scientific Reports

2026

Under Review

TriSpectralNet: A Deep Learning-Based Acoustic Monitoring of Insects in Agricultural Environments

Computers and Electronics in Agriculture

2026

Under Review

07Technical Stack

The actual tools — not the buzzword cloud.

Programming01

PythonSQLTypeScript (light)

Backend & Systems02

REST APIsMicroservicesAsync ProcessingEvent-Driven ArchitectureDistributed SystemsFault Tolerance

Data Engineering03

Apache SparkApache AirflowKafkaETL / ELTSchema ValidationBatch & Stream Processing

ML / AI04

Scikit-learnPyTorchPandas / NumPyFeature EngineeringModel EvaluationThreshold Optimization

Generative AI05

RAGMulti-Agent SystemsGPT-4 / GPT-3.5Prompt EngineeringVector StoresHybrid Retrieval

Cloud & MLOps06

AWS S3 / EC2 / EMR / LambdaDockerCI/CDSnowflakeModel Versioning

Databases07

PostgreSQLDynamoDBMongoDB

Visualization08

TableauPower BI

08Credentials

Achievements and certifications.

Achievement

Smart India Hackathon 2024

National Runner-Up

Led a team of six to build a publication data analytics solution — improved data processing efficiency by 25%.

Achievement

Capgemini Top Performer

Two consecutive quarters

Recognized for delivering production-grade data and ML systems across financial services use cases.

Achievement

Promoted twice in 2 years

Capgemini

Recognition for engineering depth and consistent on-call delivery in financial services.

Achievement

CodeChef Rank < 50 (×2)

Competitive Programming

Achieved a sub-50 global rank in two contests.

Achievement

LeetCode Rating 1875

Active practice

Consistent DSA practice across topics; profile public.

Certifications

Oracle Certified Data Science Professional

Oracle·OCI-DS

Oracle Certified Generative AI Professional

Oracle·OCI-GENAI

Azure Fundamentals

Microsoft·AZ-900

Deep Learning Specialization

DeepLearning.AI·COURSERA

Databases & SQL for Data Science with Python

IBM·IBM-SQL

09Education

Aug 2024 — Apr 2026

M.Tech in Data Science

Indian Institute of Information Technology, Bhopal

GPA 9.12 / 10 · Ongoing

Jul 2016 — May 2020

B.E. in Computer Science

Savitribai Phule Pune University

Engineering fundamentals + first SWE internship.

10Contact

Best way to reach me — for roles, collaboration, or interesting problems.

rishi.suyash01@gmail.com

RESUME.PDFCURRENT

One-page resume detailing my experience in distributed data ingestion, ML pipelines, MLOps orchestration, and GenAI systems — with metrics and stack.

Download Resume

CURRENTLY

Open to SDE, ML, Data & AI Engineering roles. Based in Bhopal, India.

DROP_A_NOTE// takes 20 seconds

OPEN