Software Engineer & Data Architect

Abdul Sohail Ahmed

role →

Building scalable systems that process 10M+ daily transactions. Architecting AI pipelines over 52 GB knowledge bases. From Walmart's distributed microservices to LangChain RAG systems — I engineer at the intersection of data and intelligence.

Spring Boot Apache Spark LangChain · RAG Snowflake Azure Databricks Kubernetes GPT-4o Microservices Cassandra
Get In Touch → View My Work
0
Years Experience
10M+
Daily Transactions
0
SJSU GPA
SCROLL TO EXPLORE

Experience

Oct 2025
Present
Walmart
Tracy, CA
Software Engineer
  • Architecting and developing production-grade REST APIs using J2EE and Spring Boot under a Service-Oriented Architecture (SOA) model, powering distributed microservices that collectively handle 10M+ daily transactions with strict SLA and uptime requirements.
  • Designing event-driven, asynchronous microservices using messaging patterns (publish/subscribe) that decouple service dependencies, reduce latency bottlenecks, and improve system resilience across Walmart's supply chain and fulfillment infrastructure.
  • Working with Apache Cassandra for high-throughput distributed data storage — modeling wide-column schemas optimized for read-heavy workloads and configuring replication strategies to ensure fault tolerance across data centers.
  • Contributing to a CI/CD pipeline using version-controlled code reviews, automated build checks, and integration test suites, ensuring reliable deployment cadence across staging and production environments.
  • Implementing JUnit and Mockito-based unit test suites, raising overall code coverage by 5% and introducing test patterns that the team adopted to standardize coverage expectations across new service modules.
  • Collaborating cross-functionally with product, QA, and platform engineering teams to define API contracts and service boundaries, applying domain-driven design principles to ensure microservices remain loosely coupled and independently deployable.
  • Participating in regular architecture reviews and sprint planning, contributing to decisions around service decomposition, database schema design, and API versioning strategies that affect millions of end-user transactions.
Spring BootJ2EECassandraMicroservicesSOAREST APIsJUnitMockitoCI/CD
Apr 2025
Oct 2025
Flix
San Jose, CA
Data Analyst
  • Designed and engineered a full end-to-end ETL pipeline ingesting HR data from ADP and Workday, transforming and loading it into Snowflake, then surfacing it via Microsoft Fabric — automating workflows previously handled manually and cutting associated effort by 11.32%.
  • Leveraged PySpark in Azure Databricks to handle large-scale data transformation, applying distributed processing to multi-million-row HR datasets while maintaining sub-minute execution times on scheduled batch runs.
  • Integrated Anaplan into the financial planning workflow, enabling finance teams to consume clean, pre-validated HR cost data for headcount planning and budget forecasting with significantly reduced manual reconciliation overhead.
  • Optimized over 40+ SQL transformation jobs in Snowflake and Azure Synapse Analytics using GitHub Copilot-assisted query tuning — rewriting inefficient nested subqueries, replacing correlated queries with CTEs and window functions, and improving overall data quality scores by 21%.
  • Built and maintained automated Power BI dashboards that delivered freshly processed HR and financial metrics to stakeholders on a schedule, eliminating 6+ hours of manual weekly reporting and improving data-driven decision velocity.
  • Defined data quality validation rules and monitoring checks within the pipeline, catching upstream schema drift and missing values before they propagated into reporting layers — significantly reducing data incidents and stakeholder escalations.
  • Collaborated with HR Business Partners and Finance leads to translate business requirements into data models, ensuring KPIs like attrition rate, headcount trends, and cost-per-hire were accurately captured and updated in near-real-time.
SnowflakePySparkAzure DatabricksPower BIAnaplanMicrosoft FabricAzure SynapseADPWorkday
May 2024
Nov 2024
StackGen
San Jose, CA
Data Engineer Intern
  • Built a production LangChain RAG-based infrastructure manifest generator supporting multiple LLM backends — Llama 3, Mistral 7B, Gemma, and GPT-4o — allowing the system to dynamically switch models based on latency and cost targets, improving generation accuracy by 12% and product adoption by 8%.
  • Designed and exposed REST API endpoints around the RAG pipeline, enabling front-end and third-party services to invoke manifest generation with structured prompts — containerized using Docker and orchestrated via Kubernetes for auto-scaling under variable load.
  • Architected and maintained scalable ETL data pipelines using Hadoop HDFS for raw storage, Apache Spark (PySpark) for distributed transformation, DBT for modeled data layers, MongoDB as a document store for semi-structured outputs, and Apache Airflow for scheduling and orchestration — collectively improving processing efficiency by 25% and reducing pipeline failures by 15%.
  • Tuned Spark job configurations — including executor memory allocation, shuffle partition optimization, and broadcast join strategies — to handle multi-gigabyte daily data volumes efficiently within cloud resource constraints.
  • Integrated HubSpot CRM and Amplitude analytics APIs to pipe user behavior and engagement telemetry into the internal data platform, enabling product and GTM teams to track feature adoption funnels and reduce customer churn by 10% through data-informed intervention campaigns.
  • Connected OpenAI APIs into the product's intelligence layer for automated summarization and semantic search, accelerating developer decision-making speed by 20% according to internal usage benchmarks.
  • Documented pipeline architecture, data contracts, and model integration patterns in an internal engineering wiki, reducing onboarding time for new engineers and establishing reusable standards for future AI feature development.
LangChainApache SparkDockerKubernetesAirflowDBTMongoDBHadoopGPT-4oHubSpot API
Dec 2020
Nov 2022
Deloitte
Hyderabad, IN
Data Analyst
  • Led the integration of Sage Intacct and PeopleSoft ERP financial data into a unified analytics layer, resolving schema inconsistencies, normalizing chart-of-accounts structures, and enabling cross-system reporting for audit engagements across multiple enterprise clients.
  • Single-handedly designed and delivered 80+ interactive dashboards across Tableau, Power BI, Alteryx, and Looker — covering financial statements, risk indicators, audit trail summaries, and compliance KPIs — improving audit quality scores by 11% and reducing manual report preparation time by an estimated 30%.
  • Processed and analyzed large-scale financial datasets using Azure Databricks, applying distributed computing to reconcile transactional records spanning millions of rows across fiscal years, enabling auditors to identify anomalies at a scale previously infeasible with traditional tools.
  • Engineered fraud detection workflows using SQL, Python, SAS, and ACL Analytics — building rule-based and statistical models to flag outliers in journal entries, vendor payments, and expense claims, contributing to a 7.81% improvement in audit process efficiency.
  • Translated audit findings into data-driven strategic recommendations presented to C-suite stakeholders, directly influencing client decisions that generated $100K in advisory revenue growth for Deloitte's risk and financial advisory practice.
  • Worked in cross-functional audit teams of 10–15 professionals, coordinating with client finance departments to extract, validate, and interpret financial data under strict regulatory compliance frameworks including SOX and GAAP standards.
  • Developed reusable Python scripts and SQL templates for common audit procedures — variance analysis, trend detection, and data completeness checks — that were adopted by other team members and reduced per-engagement setup time across the department.
TableauPower BIDatabricksSQLPythonSASAlteryxLookerACL AnalyticsSage IntacctPeopleSoft
May 2019
Jun 2019
Happiest Minds
Bengaluru, IN
Data Scientist Intern
  • Conducted comprehensive customer churn analysis on large-scale telecom datasets stored in Apache Cassandra DB, writing efficient CQL queries to extract multi-dimensional behavioral features across millions of customer records spanning a 12-month observation window.
  • Built and benchmarked churn prediction models in Python, R, and MATLAB — testing Logistic Regression, Random Forest, Gradient Boosting, and SVM classifiers — ultimately selecting an ensemble approach that reduced customer churn by 12% and increased targeted engagement by 15%.
  • Applied rigorous feature engineering techniques including RFM (Recency, Frequency, Monetary) scoring, behavioral cohort segmentation, and temporal feature extraction to improve signal quality, ultimately achieving a predictive model accuracy of 98%.
  • Designed and implemented automated data ingestion and preprocessing pipelines that standardized raw data from multiple source systems, handled missing values and outliers, and kept model inputs fresh on a weekly refresh cadence without manual intervention.
  • Built interactive Tableau dashboards to visualize churn risk scores, customer segment breakdowns, and retention KPIs — enabling the marketing team to prioritize outreach to high-risk segments and measure the downstream impact of retention campaigns in near-real-time.
  • Presented model results and business implications to senior data science and product stakeholders, translating statistical outputs into actionable retention strategies and gaining early experience communicating technical findings to non-technical audiences.
PythonRMATLABCassandraTableauRandom ForestGradient BoostingRFM Analysis

Technical Projects

01 · AI/NLP
LexLLM — AI Legal Assistant

Multi-model legal AI leveraging GPT-4o, Gemini 1.5, Llama 3.1 & Mixtral 8x7b with RAG pipelines over a 52 GB knowledge base of 10,236 legal documents. Validated via OpenAI Evals and human review.

+14.3% text accuracy -9.5% perplexity
GPT-4oLlama 3.1RAGLangChainMixtral
02 · ML/Finance
P2P Lending Risk Prediction

Ensemble model (Stacking XGBoost + Random Forest) with SMOTE handling 1:10 class imbalance for peer-to-peer lending default prediction. Achieved industry-leading accuracy with LightGBM pipeline.

99.23% accuracy +10.25% risk precision
XGBoostLightGBMSMOTEScikit-learn
03 · Analytics/.NET
VTuber Stream Analytics

Full-stack analytics platform with Azure data orchestration and Power BI dashboards. Built .NET website with Google OAuth SSO serving 1,000+ users, boosting content trend visibility and engagement.

+20% decision efficiency +25% engagement 1,000+ users
AzurePower BI.NETGoogle OAuth

Skills

Languages
JavaPythonRSQL
Backend & APIs
Spring BootJ2EEREST APIsMicroservicesSOA
Data Engineering
Apache SparkAirflowDBTHadoopADFAzure SynapseDatabricks
Databases
SnowflakePostgreSQLMySQLMongoDBCassandraAstra DB
AI/ML & BI
LangChainTensorFlowPyTorchScikit-learnPower BITableauLooker
Proficiency
Python / PySpark95%
Java / Spring Boot90%
SQL / Snowflake92%
Data Engineering (ETL)93%
LangChain / RAG / LLMs88%
Power BI / Tableau87%
Azure / Databricks85%
Docker / Kubernetes80%

Education

Master's Degree
M.S. in Data Analytics
San Jose State University (SJSU)
San Jose, CA Jan 2023 – Dec 2024
3.88 / 4.0 GPA
Bachelor's Degree
B.Tech in Computer Science & Engineering
Vellore Institute of Technology (VIT)
Vellore, India Jun 2017 – Jun 2021
9.34 / 10.0 GPA

Let's build
something great.

Open to full-time Software Engineering and Data Engineering roles. Based in the Bay Area, open to hybrid or remote. I reply within 24 hours.

Send a Message →