Work experience

Senior Data Scientist, Wordremit (Zepz), Remote

Jun 2024 - Jan 2025

Real Time Alerting System I developed an alerting system to detect mismatches between partner statements and internal database information, resulting in significant financial benefits and operational improvements. The system consists of a pipeline of SQL queries orchestrated in Airflow via Astronomer, running daily. It detects mismatches, aggregates and categorizes unbalanced transactions, and sends metrics to Grafana for monitoring. Operational leads are notified via Slack and email, and a Mode dashboard provides parameterized reports for investigation by the Data Operations team. This system has yielded substantial savings, with approximately $10 million annually from potential payout losses across over 80 partners.

The implementation of this system involved addressing several challenges. Partner statements were loaded into S3, accommodating multiple formats and evolving content representations over time without prior planning or coordination. Additionally, we managed the migration from legacy systems for two remittance products to a progressively deployed shared service. Using advanced data science techniques, we successfully reconciled transactions between the old and new systems when matches existed. This project demonstrates a robust approach to financial data reconciliation, leveraging modern data engineering and analytics tools to achieve significant cost savings and operational efficiency.
Cash-In-Transit Pipeline Optimization I refactored and improved a pipeline that reconciles the debit capture and payout side of remittance transactions, calculating various revenue types including FX and fee revenue. The enhanced pipeline outputs data via SFTP to the Blackline accounting software, significantly increasing the percentage of fully reconciled transactions. This improvement provided more comprehensive information for treasury, finance, and safeguarding teams to use in their processes.

The refactored solution addressed a persistent issue where data was consistently missing for certain periods due to subsequent data batch capture methods or errors from DAG failures. The new pipeline was designed to be modular and fault-tolerant, making it robust enough to serve as the foundation for our reconciliation system. This approach not only solved immediate problems but also created a scalable and reliable infrastructure for future reconciliation needs.
Led a project to automate metadata changes using Google APIs and Airflow, eliminating cross-functional teams' reliance on Engineering resources for changes like adding new partner accounts to the journaling process and improving the data pipelines readability. This innovation freed up Engineering resources, enabling them to focus on developing a UI for partner statements—a project that had been delayed for months but was quickly completed as a result.
Designed and implemented a non-parametric propensity model that captures the cumulative distribution for the next transaction of a user by using a semi-supervised approach. This directly impacted our pricing strategies and led to better customer retention

Finance Data Analyst, Wordremit (Zepz), Remote

Jun 2022 - May 2024

Sendwave Journaling Process Architected and implemented a company-wide financial journaling system, used to perform official revenue calculations, reducing the legacy code base by 90% (from 3000 to less than 500 lines), built and implemented using SQL with Airflow, and Python for the business logic. Built within 3 months

After migrating our repository of shared knowledge from Quip to Confluence, we encountered a significant challenge: the loss of numerous documents, including our detailed journaling logic. The only complete source remaining was the GitHub repository containing the legacy codebase, which was complex and convoluted, with numerous edge cases, many of which were no longer relevant.

To address this issue, I undertook the task of understanding the journaling process by analyzing the generated data. Utilizing PySpark, I constructed and grouped journal lines based on all relevant feature combinations, subsequently eliminating those with no effect. This analysis resulted in the classification of journal lines into stable and unstable groups. The unstable groups were attributed to inconsistencies arising from the asynchronous execution of the legacy code.

Collaborating with the Finance team, we developed a refined journaling logic and I authored a design document for the new system, which I then implemented. The resulting system proved significantly more efficient than its predecessor. The codebase was streamlined from over 3,000 lines (excluding constant files) to less than 500, enhancing readability and maintainability. Crucially, processing time was reduced from two to three days to a mere one to two hours, accompanied by a substantial decrease in resource consumption. This efficiency gain was achieved by leveraging OLAP systems and clustering transactions with identical journal characteristics that differed only in amounts.

A crucial aspect of this project involved designing configuration files in spreadsheets and YAML format, granting the Finance team control over variables such as GL accounts and account names. This approach eliminated the need for code changes when modifying these variables and facilitated the specification of diverse fee calculation methods for payout partners. This streamlined approach significantly simplified the integration of new partners into the journaling process.
Refund System Optimization Project At Zepz, an automated refund system was in place to identify transactions requiring reimbursement. However, this system had limitations: it lacked adaptability to new data anomalies and occasionally made errors, refunding balanced transactions with duplicate events.

To address these issues, I initiated a comprehensive improvement project. I implemented a sophisticated clustering algorithm that segmented transactions based on various features. This algorithm maximized as a key metric the overall entropy of the final grouping, in order to minimize the number of features and therefore resulting in a smaller number of groups. By doing so, I reduced the number of groups that needed to be manually reviewed.

I collaborated with the Data Ops team to classify these transaction groups as refundable or non-refundable. Together, we developed a pipeline for classifying new transactions and detecting new groupings based on similarity to existing ones. The system flagged ambiguous cases for manual review, ensuring that only transactions requiring human intervention were sent for further evaluation.

The project yielded significant results. We developed a more accurate and adaptive refund identification process, uncovering previously missed refund obligations. To address these historical liabilities, we engaged a specialized consulting firm to track affected users where automatic refunds were not feasible and manage the process of providing appropriate refunds. In addition, this significantly reduced the customer support contact rate related to refunds by more than 90%. This initiative not only improved the accuracy of our refund system but also demonstrated our commitment to financial integrity and customer satisfaction.
Comparative Studies SW vs WR

Data Scientist, Intact Financial Corporation, Montréal

May 2021 - Apr 2022

Claims cost model Built a pytorch time series model using techniques based on expectation - maximization algorithm that improved the mean absolute error metric of the production model by more than 50%, leading to better underwriting guidelines
Data tools Developed a Featuretools package wrapper for constructing features for pooled time series data, and a DataLoader module adopted by the wider R&D team

Data Engineer, Apple, Cupertino

Jun 2018 - Aug 2018