Video Summary and Transcription
In this video, the focus is on browser session analytics and its role in financial fraud detection. The use of tools like PySpark and the CRISPM methodology is highlighted, emphasizing their importance in analyzing large datasets. The video explores the efficient use of the Pearson Correlation Matrix for feature selection and the deployment of models on big data platforms using HDFS and Spark. Techniques such as decision trees, random forest classifiers, and gradient boosting classifiers are discussed. The validation phase uses the area under the ROC curve as a key metric due to the imbalanced nature of fraud detection datasets. The GVT classifier achieved a high score of 0.94, identifying a significant portion of fraudulent sessions. Regular updates and real-time application of these models are crucial for maintaining their effectiveness.
1. Financial Fraud Detection with CRISPM Methodology
Hello, I'm Javier Arcaide, a data scientist at Blue Tab Solutions. We specialize in advanced analytics and big data. We recently worked on improving financial fraud detection for a client in the financial sector. Using Spark and the CRISPM methodology, we analyzed the data sets and discovered valuable insights, such as the correlation between fraudulent sessions and the mobile cast page accessed from the web application. By selecting the best features and cleansing the data, we created more accurate models for detecting fraudulent transactions.
Hello, I'm Javier Arcaide. I work as a data scientist in Blue Tab Solutions, designing and developing machine learning solutions. In Blue Tab, we are experts in advanced analytics and big data, which allows us to help our clients in this kind of project.
Throughout the last few years, the financial fraud has grown dramatically, and this trend has been getting worse with the pandemic situation. At the beginning of the year, one of our clients in the financial sector asked us to improve the way they had to detect financial fraud on their online banking applications. To solve this problem, they provided us with a data set from Adobe Omniture, containing around 80 million records of the different online banking app sessions, each one with 45 fields of information, along with a data set containing the frauds detected by their fraud team in the recent months. We attacked the problem using our client's big data platform, and due to the size of the data sets, we decided to use Spark for the processing and analysis of the data.
Our approach uses a well-known data mining methodology, CRISPM. This process divides the solution in five major phases. The first one is business understanding. The purpose of this phase is to align the objectives of the project with their business objectives. We focused on understanding the client's expectations and the project goals. With this knowledge of the problem, we designate a preliminary planning in order to achieve the objectives. The second phase is data understanding. We consider this the most important phase of the methodology. On it, the goal is to know the data. It's a structure and distribution and the quality of it. We started with an univariate analysis of the data sets columns against the target. Our conclusions from this analysis were crucial to decide which variables would be included in the training of the model. In this phase, we discovered, for example, that on the 70% of the fraudulent sessions, the mobile cast page was accessed from the web application. The 90% of the sessions opened from this particular device, UMI plus, were fraudulent. This covered around 15% of the frauds. In around 75% of the fraudulent sessions, the operating system we used was Windows 8.1. The extraction of these insights is the differential value that a data scientist can offer in the creation of models. Through this acquired knowledge and selecting the best features, we were able to create much more accurate models for the detection of fraudulent transactions. The third phase is data preparation. When the variables are selected, it is time to prepare the dataset to train the different models. It is typically necessary to cleanse the data, ensuring that new values and outliers are identified. This, combined with mathematical transformations such as exponential or logarithmic functions can improve the dispersion of distribution which helps better train the model. The entire cleansing and transformations result in a new dataset with more than 200 features.
2. Modeling, Validation, and Deployment
We used the Pearson Correlation Matrix to group features and select the best one for the model. Decision trees, random forest classifiers, and gradient boosting classifiers were used to create the models. The validation phase used the area under the ROC curve as a metric. The deployment phase involved using the clients' big data platform based on HDFS and Spark. The GVT classifier yielded the best result with a score of 0.94. The model identified a grouping of sessions covering 10% of total sessions, including 90% of frauds. Working with big data tools like PySpark is essential for accurate models. Regular training is necessary as these models become outdated quickly. The next steps involve working with the model in real-time for swift action when fraud is detected.
We use the Pearson Correlation Matrix to group the features in correlated families, where we can choose the best one in the model. The fourth phase is modeling and validation. Once the training dataset was constructed, we used the algorithm contained in the libraries of SparkML. Specifically, decision trees, random forest classifiers and gradient boosting classifiers to create our models.
For the validation, we decided to use the area under the rock curve as a metric because the target was not balanced in the dataset, which implies that metrics as accuracy cannot be used. In the deployment phase, the last one, we use our clients' big data platform based on HDFS and Spark to deploy the model. It runs once a day with the data of the date before, which has around six million records. Since the model is designed and developed using Spark, it is possible to deploy it in any platform, cloud or on-premise, capable of deploying Spark apps.
After the validation of the model, we found that the GVT classifier yielded the best result, with a score of 0.94 on the area under the curve. The model created was able to identify a grouping of sessions which covered 10% of the total sessions, where the 90% of the frauds were included. This allows analysts to spend more of their time on higher risk cases. In conclusion, in order to have more accurate models, it is important to use the full population of the data. This would be impossible without working with big data tools as PySpark. These great results are based on the previous study of the variables, and the insights obtained during the analysis. On the other hand, this kind of model becomes outdated quite fast, so it is necessary to train it regularly, usually every two months. Next steps would be to work with this model in real-time, so the clients can take action swiftly when the fraud is detected, such as asking for a double authentication or blocking the transactions if the model predicts fraudulent actions.
Comments