Summary
Experienced Data Science/Engineer professional with 7+ years hands-on experience in Designing data-intensive applications using Hadoop Ecosystem and Big Data Analytical, Cloud Data engineering (AWS, GCP), Data Visualization, Data Warehousing, Reporting, Data Pipelines and Data Quality solutions. Hands-on expertise with data engineering stack, including Python, SQL, and worked with databases like My SQL, SQL Server, Snowflake, Mongo DB, Cassandra and writing ETL’s. I have also Built Statistical and Machine Learning models. Exceptional understanding of Descriptive and Predictive Analytics. I have extensively worked on Data Preprocessing, Interpreting Complex and Multidimensional Datasets, Database Management, Programming, Problem Solving, Model Deployment and Maintenance, Model Optimization, Metrics etc. making a true impact across various domains and industries.
Experience
Data Scientist @ XSELL Technologies (Jun 2022 - Present)
Summary: Developed advanced Natural Language Processing Models, Pipelines and Metrics using Unstructured Data for Multiple High Profile Clients
- Built SpaCy Transformers Natural Language Models and Pipelines with Accuracy of 85% by implementing custom components.
- Estimated metrics like Precision and Recall that helped with process improvements, enabling $500,000 in business for a major client.
- Worked on automating ML processes to streamline the model development, testing and deployment workflow.
- Finetuned ALBERT Transformer models using AWS Sagemaker and HuggingFace with 5TB of de-identified transcript data.
- Designed and architected end-to-end MLOps pipeline leading to 60% increment in model operational efficiency.
- Automated model training pipeline by Containerizing training code which reduced model deployment time by 15%.
- Technologies & Concepts Used: Python, SQL, Excel, NLP, Statistics, Machine Learning, AWS, Snowflake, Data Mining, Supervised and Unsupervised Learning, Advanced Models.
Data Scientist @ University of South Florida (Aug 2021 - May 2022)
Summary: Deployed Natural Language Processing Models based on Similarity and Semantic Analysis on Unstructured Data
- Estimated Kickstarter Project consistency at 7% based on project description by extracting Unstructured (text) data using API.
- Applied text pre-processing techniques on the data and did a Similarity Analysis using Python with NLP techniques like Cosine Similarity.
- Estimated that 36% of Reddit posts were regarding health issues in 2020 by fetching Unstructured data using Web-Scraping.
- Applied text pre-processing techniques on the data and did Semantic Analysis using Python with NLP techniques like LDA, TFIDF, etc.
- Designed a website for an AI Research study and hosted on GCP that could handle over 1000 daily visits.
- Technologies & Concepts Used: Python, R, NLTK, Gensim, Natural Language Processing, Text Mining, Similarity Analysis.
Data Engineer @ University of South Florida (Jan 2021 - Jul 2021)
Summary: Built GCP Data Pipelines and Forecast Company Growth from Emerging Technologies using Social Media and Financials - GitHub
- Fetched 100GB of SEC 10K text data by utilizing Web-Scraping techniques and piloted ETL packages for extracting YouTube transcripts.
- Prepared a text corpus of SEC business and risk sections and YouTube transcripts for ~70 companies in the IT domain.
- Built Data Pipelines in airflow in GCP for ETL related jobs using different airflow operators.
- Applied data preprocessing, cleaning, and transformation steps to build a 100+ feature set for modeling across multiple documents.
- Performed Similarity Analysis after applying Topic Modeling techniques like LDA, TFIDF which yielded Highest Correlation of 38%.
- Emergent technologies were taken from Gartner hype cycle reports for the last 5 years.
- Technologies & Concepts Used: GCP, Python, R, NLP, NLTK, Gensim, ETL, Web Scraping, Data Warehouse, Topic Modelling, LDA, TFIDF, Text Analytics, Similarity Analysis.
Machine Learning Engineer @ Tampa General Hospital (Feb 2020 - Dec 2020)
Summary: Optimized Medical Clinic Operations by Building ML Models and Metrics
- Improved clinic Operating Efficiency by 38% by optimizing the patient processing flow by building Random Forest on key features.
- Automated model training and publishing by using Scikit-Learn data pipelines.
- Built metrics and visualizations in Tableau and deployed on AWS which gives high level operational overview for the top management.
- Technologies & Concepts Used: Python, Tableau, PowerBI, Excel, Data Visualization.
Junior Data Scientist @ Amazon (Sep 2017 - Jul 2019)
- Developed and maintained PAN-EU Inbound Analytics and Reporting tools that facilitated freight planning and scheduling.
- Reduced overall Freight Delay Risk by 12% by creating Automation Tools and Risk Assessment Metrics using Tableau and Python.
- ALPS Prediction: Predicted labor productivity with 82% accuracy which led to Freight Optimization across regions in EU.
- Extracted data from AWS Server and developing SVM and Logistic Regression Models on the freight data.
- Identified 50% more KPI’s from queried data using SQL after Implementing data cleaning techniques in Python.
- 3PL Deployment: Collaborated with supply chain for inception and deployment of 3PL (Third Party Logistic) sites with 92% success.
- Increased total Annual Shipments by 34% over a span of 6 months by developing key processes and metrics by A/B Testing.
- Technologies & Concepts Used: Python, R, SQL, Tableau, Excel, Machine Learning, AWS, Oracle, Data Mining, Supervised Learning, Data Visualization.
Data Engineer @ Mahindra & Mahindra (Aug 2016 - Aug 2017)
Summary: Built AWS data pipelines and Metrics for production forecasting and Optimized process flow to achieve significant cost savings
- Used various AWS services including S3, EC2, AWS Glue, Athena, RedShift, EMR, SNS, SQS, DMS, Kinesis.
- Extracted data from multiple source systems S3, Redshift, RDS and Created multiple tables/databases in Glue Catalog by creating Glue Crawlers.
- Used AWS data pipeline for Data Extraction, Transformation and Loading from homogeneous or heterogeneous data sources and built various graphs for business decision-making using Python matplot library.
- Predicted Rework Rate for exports with 89% Accuracy by deploying a predictive model over a 3-month period of AB testing.
- Saved over $90,000 in Costs by utilizing lean-agile methods and by leveraging data achieved a 15.3% reduction in rollout times.
- Technologies & Concepts Used: AWS, SQL, Tableau, Excel, Machine Learning, Hypothesis Testing, Data Visualization.
Data Analyst @ Mahindra & Mahindra (Jun 2015 - Jul 2016)
Summary: Built tools and methods for Optimizing production process flow
- Created interactive dashboards by analyzing patterns in the daily production metrics.
- Assisted executives with defects reduction using data visualization in Excel which led to an increase in daily production by 9%.
- Technologies & Concepts Used: SQL, Excel, Data Visualization.
Technical Skills
Programming Languages: Python, R, SQL, JavaScript, C#, HTML, CSS, ASP .Net Core MVC, SAS
Databases: Oracle, My SQL, SQL Server, MongoDB, Cassandra, Snowflake, ETL
Big Data & Deployment: Apache Hadoop, Hive, Impala, Spark, Spark SQL, MLlib, Databricks, AWS Sagemaker, Docker, TensorFlow Serving
Visualization & Cloud Tools: Google Cloud (GCP), AWS, EC2, S3, Redshift, Microsoft Azure, Tableau, Power BI, Excel, VBA
Domain Experience: IT, E-Commerce, Healthcare & Automotive
Machine Learning: Supervised, Unsupervised Learning, Classification, Regularization, CNN, RNN, Anomaly Detection, K-NN, SVM, Naïve Bayes, Decision Tree, Random Forest, Keras, TensorFlow, Text Mining, Natural Language processing (NLP)
Statistics: Descriptive and Inferential Statistics, Linear (OLS, GLM), Logistic, Poisson, Hypothesis Testing, ANOVA, Survival Analysis, Mixed Models, Linear-Mixed Effect Models (LMER), A/B Testing, Data Science Pipelines, Time Series, APIs, Excel, Git
Education
Master of Science in Business Analytics and Information Systems (MS-BAIS) @ University of South Florida (Aug 2020 - May 2022)
- Cumulative GPA: 3.9/4.0
- Coursework: Statistical Data Mining, Data Science Programming, Machine Learning, Text Analytics, Big Data, NLP.
Bachelor of Technology in Mechanical Engineering (MECH) @ GITAM University (Jul 2012 - May 2016)
- Cumulative GPA: 3.5/4.0
- Coursework: Mathematics, Probability, Supply Chain Analytics, Management Information Systems, Linear Algebra and Statistics.
Projects
Applied Data Scientist @ University of South Florida (Jan 2021 - May 2022)
Patient Turnup Rate at Clinics Analysis: Forecast Patient Show-Up Rate at Medical Clinics - GitHub
(Data Science Pipelines, Spark SQL, Python, Spark, Spark MLlib, Databricks, Classification, Azure, ML Pipeline)
- Led a team of 3 and developed a Random Forest Classifier to predict No-Shows with 80% Accuracy on patient data and made recommendations on Wait Times and Scheduling to optimize clinics operational efficiency.
- Leveraged Azure Databricks analytics platform to create data preprocessing, exploratory data analysis, feature extraction, and modeling pipelines.
Multi-Class Malware Classification: Flagged Malicious Software into Multilevel Categories - GitHub
(Classification, Python, CNN, TensorFlow, Keras, Neural Networks, Unbalanced Data, GCP)
- Led a team of 4 and developed Classification models to categorize files into binary or nine-category bins from Image and Batch data.
- Performed exploratory data analysis, feature extraction and built Convolutional Neural Network and XGBoost (Hyperparameter tuned) models that achieved the lowest misclassification rate at 1.4 (Closer to 0 better) and a Test Accuracy of 94%.
Multi-Level House Price Prediction: Matched Customers with Estimated Average House Pricing based on various KPIs - GitHub
(R, Tableau, Python, Multi-Level Regression, Linear Mixed Effect Models (LMER), Statistical Modelling)
- Built a Multilevel Model to predict the average house price based on crucial economic factors like mortgage delinquency, minimum wage, income, etc.
- Recommended changes in Minimum Wage (Effect by 5%) and Subsidies (Effect by 4%) that can be implemented to reduce the housing prices.
Let’s Connect
Thanks for stopping by. You can find my portfolio and my other projects from the following links.
Email | Portfolio | LinkedIn | GitHub | Twitter