Raj Purohith Arjun

           

Masters in Data Science

Texas A&M University

Department of Computer Science and Engineering

Graduate Student

Institute of Data Science




CURRENT PROJECT

  • Developing a GenAI pipeline for lesson tagging and semantic search using GPT-4, BERT/T5, Weaviate, Algolia, and Hugging Face to improve lesson discoverability @SubjectToClimate View
  • Developing an AI-powered Workplace Investigation Tool by fine-tuning legal LLMs (SaulLM7B, BERT) and building a RAG+FAISS pipeline for fast document retrieval and actionable HR compliance guidance @Atlas Power Inc, Texas

Projects

Go to my Github page to explore my work and collaborate.

AI-Powered Workplace Investigation Tool - Data Science Capstone with Atlas Power Inc, Texas

developing an AI-powered Workplace Investigation Tool that helps managers conduct workplace investigations with consistency, compliance, and legal defensibility. The project involves fine-tuning legal-domain LLMs such as SaulLM7B and BERT, and building a Retrieval-Augmented Generation (RAG) pipeline with FAISS for fast and accurate legal document retrieval. I am implementing end-to-end workflows for ingestion, search, and generation, enabling the tool to provide actionable recommendations and guidance for HR compliance scenarios

InterviewAgent.io: Innovative GenAI-powered Interview Platform

Built an AI-driven mock technical interview platform using Google Gemini, LangGraph, and few-shot learning, simulating realistic, multi-turn coding interviews with function calling, structured outputs, and image/code understanding. Enabled personalized feedback, skill assessment, and scalable preparation through agentic workflows and long-context reasoning


Optimizing AI Workflows for K–12 Education: Tool Benchmarking & Prototyping

Benchmarked NLP and GenAI tools (Hugging Face Transformers, OpenAI GPT-4, Cohere APIs, Weaviate) to automate lesson tagging, document summarization, and semantic search. Delivered Python-based prototypes that reduced manual workload by 70%, enhanced resource discoverability, and enabled cross-team usability in a nonprofit, low-infra environment

SmartLink – Graph-Based Recommendation System

Built a hybrid AI system that recommends connections based on shared institutions and social graphs. Used Python, DuckDB, and OpenAI embeddings for structured profile parsing and semantic similarity, while modeling relationships in Neo4j. Orchestrated pipelines with LangChain and deployed an interactive Streamlit UI for real-time recommendations using Qdrant


Prophetic Sentinel : Anticipatory Maintenance System


Crafted an anticipatory maintenance system using Python (scikit-learn, TensorFlow), SQL, and Azure leveraging machine learning to prophesy equipment failures in data centers. This proactive approach curtailed downtime and slashed maintenance expenses significantly. Garnered a striking 90% accuracy in foretelling equipment failures, leading to a notable 30% drop in maintenance costs.


AI Video summarization


Developed using Mixtral, Whisper, and AWS, integrating language models and tools like GPT-3, BERT, and FFmpeg for efficient video processing. Set up AWS EC2 instances to run large models, reduce latency, and transcribe audio to text with Whisper. Implemented dynamic quiz generation and a feedback system using Flask, HTML, and JavaScript, storing user data in a database. The project highlighted the strengths and limitations of various language models and their practical application in video summarization and interactive quizzes.


AI-driven Financial Fraud Detection Using Deep Learning and XAI


Developed an AI-based system to detect financial fraud in real-time, leveraging deep learning models like Transformers, CNNs, and GANs for fraud identification and simulation of forged transactions. Use EfficientNet for image-based fraud detection, while integrating Explainable AI (XAI) techniques to enhance model transparency and ensure ethical decision-making. Visualize fraud pa erns and risks using Tableau or Seaborn, aiding stakeholders in making informed, data-driven investment decisions. Achieved 95% accuracy in detecting fraudulent transactions, reducing false positives..


End- to-End MLOPs Pipeline for Loan Eligibility Prediction


Built MLOps pipeline for a Loan Eligibility Prediction model using Python, deployed on Google Cloud Platform (GCP). The pipeline involved creating a Flask API, containerizing it with Docker, and managing source code through Cloud Source Repository and Git.Automated deployment was handled via Cloud Build, and the model was deployed using Cloud Run. This project demonstrated efficient cloud architecture, leveraging GCP services for scalable and automated machine learning operations.


Bank Customer Churn Prediction


Conducted statistical and multivariate analysis on customer data to identify key churn drivers. Developed predictive models using logistic regression and Random Forest, improving churn prediction accuracy by 85%. Provided actionable insights to stakeholders, enabling retention strategies that reduced churn rates by 20%.


PersonaCraft: Tailored Content Recommender


Engineered a content recommendation system driven by machine learning to tailor content suggestions for individual users. This bespoke approach fostered a substantial increase in user engagement. Amplified user engagement by a commendable 25% through personalized content recommendations.


Ensemble Model for Vehicle Tracking and ANPR


Leveraging the robust capabilities of YOLOv5, YOLOv8, DeepSORT, and Easy OCR, engineered an ensemble model tailored for number plate recognition (ANPR) and vehicle tracking, particularly excelling in low light conditions. Achieving an impressive F1 Score of 0.97, this model stands as a testament to its efficacy in challenging environments. Complemented by a user-friendly web interface, it emerges as a versatile and robust solution for ANPR.


Financial Risk Prediction


Built a predictive model to assess credit risk using logistic regression and decision trees, achieving a risk classification accuracy of 92%. Conducted Time Series Analysis to identify trends in loan defaults, improving risk profiling. Developed dashboards in Power BI to visualize risk metrics and trends, enhancing transparency for stakeholders.




SKILLS

  • Python / C++ / SQL / R / Docker / Kubernetes / Apache Spark / Hadoop / AWS (Lambda, S3, Redshift, SageMaker) / GCP (BigQuery, Dataflow) / Power BI / Tableau / Excel
  • TensorFlow / Keras / PyTorch / Scikit-learn / PyCaret / Hugging Face Transformers / spaCy / Pandas / NumPy / Matplotlib / Seaborn / Plotly / ETL & Data Pipeline Automation
  • ML & AI: Supervised & Unsupervised Learning, NLP (BERT, T5, GPT), Deep Learning, Generative AI, Transfer & Reinforcement Learning, XGBoost, ONNX
  • MLOps & Deployment: MLflow, CI/CD, REST APIs, Cloud Deployment (AWS, GCP, Azure)
  • Data Analytics & Visualization: EDA, Data Storytelling, Executive Dashboards, Analytical Thinking
  • Professional Skills: Stakeholder Communication, Cross-functional Collaboration, Agile Methods, Ethical Data Handling
  • Certifications: Microsoft Azure Data Science Associate







  • Professional Experience

    Education

    • 2024-26: Masters in Data Science , Texas A&M University(Expected Graduation: May 2026) - GPA : 3.66/4.0
      • Relvant Coursework: Mathematical Foundations for Data Science, Database & Comp Tools for Big Data, Statistical Foundations for DataScience, Data Mining and Analysis, Natural Language Processing, Deep Learning, Applied Analytics, Software Engineering Workflows
    • 2020-24: Bachelors in Computer Science (AI & ML), SRM University - CGPA : 3.92/4.0
      • Relevant Coursework: Advanced ProgrammingPractices, Data Structure and Algorithms, Computer Vision, Database Management Systems, Artificial Intelligence, Statistical Machine Learning, Operating Systems
      • Ranked 1st in the Department(Dean’s List) for the years 2023 & 2024
      • Received Best Paper Award for the Research " Innovation in Vehicle Tracking: Harnessing YOLOv8 and Deep Learning Tools for Automatic Number Plate Detection

    Internships

    Data & AI Integration Intern - SubjectToClimate, New York (Remote)

    • Engineering a GenAI lesson-tagging pipeline using GPT-4 and fine-tuned T5/BERT classifiers to reduce manual annotation and improve content organization across 1,000+ educational resources.
    • Optimizing semantic search by evaluating and deploying embeddings (Sentence-BERT, OpenAI, Hugging Face variants) with Algolia and Weaviate, enhancing retrieval relevance and educator discoverability.
    • Constructing predictive tagging workflows with NLP feature engineering, SQL-based data extraction, and logistic regression models to improve search precision and automate content classification.
    • Conducting statistical evaluation and A/B testing on AI platforms and educator engagement data, applying cohort modeling, hypothesis testing, and behavioral segmentation to guide data-driven decisions.
    • Developed 10+ Tableau dashboards for farmers, agronomists, and policymakers to facilitate data-driven decision-making.
    • Deploying and monitoring scalable infrastructure with cloud integration, vector databases, and observability tools (Prometheus, Grafana), ensuring robust API performance and actionable insights while reducing operational costs.

    Graduate Data Science Assistant (TAMIDS) - Texas A&M University, College Station

    • Analyzed 1Lakh+ soil CO2,N2O and CH4 flux measurements using automated sensors and trace gas analyzers,ensuring 95% precision in environmental data collection.
    • Developed and optimized ETL pipelines for high-frequency climate and agricultural sensor data using Apache Airflow, Python, and PySpark, reducing data preparation time by 30% and improving data reliability for precision farming applications.
    • Built scalable data infrastructure supporting multi-terabyte satellite and IoT sensor data from 40+ interconnected applications, enabling real-time monitoring of soil health, crop conditions, and climate patterns at an ingestion rate of 50Hz.
    • Designed efficient database solutions and optimized SQL queries, enhancing data analysis speed by 20% and enabling real-time insights for agriculture yield forecasting and climate risk assessment.
    • Developed 10+ Tableau dashboards for farmers, agronomists, and policymakers to facilitate data-driven decision-making.
    • Applied advanced ensemble modeling techniques (Bagging, Random Forest, Gradient Boosting, AdaBoost) to improve crop yield prediction accuracy and optimize resource allocation in sustainable farming practices.

    Software Engineer (AI & ML) - Entropik, Chennai, India

    • Led the development and optimization of AI-powered consumer research solutions using advanced machine learning models and techniques, driving significant improvements in customer insights and engagement.
    • Designed and implemented machine learning models to analyze and predict consumer behavior using advanced AI techniques, including Emotion AI, Behavior AI, and Predictive AI, resulting in more accurate consumer insights.
    • Utilized Python, Pandas, and NumPy to manage large-scale consumer data sets, processing millions of data points to uncover actionable insights and enhance model performance.
    • Applied supervised learning techniques, including logistic regression and support vector machines (SVM), to predict customer behavior and tailor personalized marketing strategies, improving campaign effectiveness by 25%.
    • Enhanced predictive accuracy through data preprocessing techniques such as normalization, scaling, and outlier detection, ensuring more reliable results in consumer behavior forecasting.
    • Developed and fine-tuned recommendation systems for personalized consumer experiences, leveraging collaborative filtering and content-based methods to improve user engagement and satisfaction by 20%.
    • Employed advanced data visualization tools like Matplotlib and Seaborn to present complex consumer behavior patterns and trends, helping stakeholders easily interpret data and make informed decisions.
    • Led the implementation of AI-powered insights into consumer preferences and behaviors, driving 30% more accurate product recommendations and improving marketing ROI.
    • Utilized SQL and cloud platforms (AWS, GCP) to streamline data pipelines and optimize data storage and retrieval processes, reducing data processing times by 40%.
    • Collaborated with cross-functional teams, including product managers and marketing analysts, to ensure AI models aligned with business goals and consumer expectations.
    • Delivered technical presentations to both technical and non-technical stakeholders, effectively communicating the impact of AI-driven consumer insights and providing actionable recommendations for improving consumer engagement.

    Data Science Analyst - High Radius, Chennai, India

    • Built and deployed a comprehensive AI-enabled Fintech B2B cloud application, focusing on creating a scalable, full-stack web-based product.
    • Leveraged Python libraries like Pandas, NumPy, and Scikit-learn for in-depth data analysis, as well as JavaScript for interactive data visualization.
    • Conducted extensive data preprocessing and feature engineering, including data cleaning, wrangling, normalization, and scaling, to prepare datasets for predictive modeling.
    • Employed text vectorization techniques to classify user segments for targeted financial services.
    • Led the data analysis and machine learning aspects, implementing classification models such as Gradient Boosting, XGBoost, and Random Forest with extensive parameter tuning.
    • Evaluated model performance using metrics like accuracy, precision, recall, and F1-score to select the most effective model for financial risk prediction.
    • Utilized advanced anomaly detection techniques to identify payment discrepancies, increasing fraud detection capabilities by up to 90%.
    • Conducted A/B testing to analyze transaction patterns and collaborated with marketing teams, leading to targeted campaigns that improved transaction rates.
    • Executed SQL operations, including complex joins and data aggregation, to streamline the ETL process, enhancing data transformation and storage in a centralized database.
    • Set up new database schemas and optimized queries to improve data management and retrieval speed.
    • Developed automated data transformation workflows, converting raw data into formats suitable for analysis. Created an experimental framework for automated data collection and built real-time approval systems, reducing manual intervention and increasing productivity by 30%.
    • Collaborated with cross-functional teams, including UI/UX designers and backend developers, integrating machine learning insights into the application to enhance user experience.
    • Visualized data insights using tools like Tableau and R Markdown, providing detailed exploratory analysis and uncovering key business trends.
    • Identified key areas for procedural improvement through customer data analysis, providing actionable insights that enhanced decision-making and profitability.
    • Applied various clustering techniques to detect underperforming segments, leading to strategic adjustments that boosted overall system efficiency.
    • Maintained a high standard of attention to detail in handling data and building models, ensuring the accuracy and reliability of predictions. Effectively communicated insights through written reports and presentations to stakeholders, facilitating data-driven business decisions.

    Publications and Awards
    • Selected as a judge for Student Research Week (SRW) 2025 at Texas A&M University, providing a platform for students across disciplines to showcase their innovative work through oral and poster presentations.
    • 2025 Appointed as a Judge and Mentor for the 2-day TAMU Hack 2025 hackathon, guiding participants, providing technical mentorship, and evaluating 20+ projects based on innovation, execution, and impact.
    • 2024 Won Best Paper Award at International Conference on Computing Technologies for Sustainable Development-2024 for "Innovation in Vehicle Tracking : Harnessing YOLOV8 and Deep Learning Tools for Automatic Number Plate Detection" Check
    • 2023 Won Best Paper Award at the National Conference on Technology for the Society’23 for the research paper “ Enhancing ANPR using YOLOv8 and Deep Learning Techniques” held at SRMIST, Chennai Check
    • 2023 Received an Academic Award for Overall Proficiency Rank-1 in the Computer Science Department for the Year 2023 and 2024 , SRM University Check
    • 2023 Ranked in the top 10 out of 2500 participants in Proglint’s Alliance University Computer Vision Hackathon 2023.






    Certifications

    My Profile on Linkedin

  • Microsoft Certified: Azure Data Scientist Associate (certificate)
  • Hackerank Certified: SQL Advance Programmer (Verify)
  • Python for Data Science, AI & Development IBM (certificate)
  • Application of Machine Learning in Urban Studies (IIRS &ISRO) (certificate)
  • Neural Networks and Deep Learning by Andrew Ng (verify)
  • Introduction to Machine Learning by Debjani Chakraborty (NPTEL, IIT KGP) (verify)
  • Natural Language Processing by Haimanti Banerji (NPTEL, IIT KGP) (verify)



  • Contact


    Mailing address: Unit : 204 , The Villas of Cherry Hollow, 503 Cherry Street, College station, TX 77840

    E-mail: raj2001@tamu.edu or rajpurohitharjun58@gmail.com
    Linkdedin:  Reach me at Linkedin