Generating Betting Odds for Fantasy IPL using Monte-Carlo Simulation and Neural Networks

Summary

  • The Project was developed for Mastermind Sports, UK while working as an independent Analytical Consultant, to develop live fantasy cricket betting sports on their platform.

  • Modeled odds of winning of a given team against opposition team. The odds are further modified and used to calculate the bid price and winning price for users. The odds and the bid price are displayed on the app to end users.

  • Developed a ball-by-ball prediction model using Neural Networks in TensorFlow for simulating an IPL match

  • Used more than 10 states of the game at any given moment as features for the model.

  • Used Monte Carlo Simulation, leveraging ball-by-ball prediction model to get the distribution of more than 12 game and player statistics: list of statistics (like …..).

  • The distributions of runs made and wickets taken from the Monte-Carlo simulation were compared with the original distributions to validate the output of the ball prediction model and Monte-Carlo simulation.

  • The distributions were then used to generate odds of winning for specific bets.

  • The generated odds were further verified by bookmaker software and used to decide the winning amount of any bet.

  • The winning amount of a bet and corresponding bid for the user is displayed on the app.

Data Sources

For ball prediction model:

  • Data for 100+ IPL matches from 4 international tournaments was scraped from the internet. The dataset contained details of every bowl played in the game.

For Modelling Odds of team A winning matching against team B:

  • matches.csv -> Contains the details of all matches played between 2008 and 2019.

    • For 2008-2018, data downloaded from Kaggle

    • For 2019 season, data scrapped from cricinfo.com

  • deliveries.csv -> Contains ball-by-ball data for the matches played in all of the IPL seasons

    • For 2008-2018, data downloaded from Kaggle

    • For 2019, data was collected and cleaned by a participant in PyData meetup

  • Wikipedia -> Multiple teams had different home grounds in different seasons

    • It was observed that ‘Home Ground effect’ was still applicable on the original home ground even after it was shifted to another city.

    • The original/first home ground is assumed to be home ground for all seasons.

Deep dive

The aim of the first part of the project was to be able to simulate a reasonable 2 innings T-20 game given the two IPL teams playing against each other, the players in both teams, and their history in IPL seasons so far.

In the second part of the project, I was expected to provide the odds of certain output game events happening. The way I calculated the odds was to use Monte-Carlo Simulation to simulate the game hundreds of times and use the distribution of the output events to calculate the odds.

 

The simulation of the game is exposed to the user in real-time, on the basis of which the user bets on certain markets like - which team would win, how many wickets would a specific bowler take, how many runs would a specific batsman make, etc. The user is also shown the odds of winning a certain bet, based on which the betting price is set. The odds determine how large a winning amount would a user get upon winning a particular bet in case the user wins.

If a user is betting on an event that is not very likely to happen, i.e. when the odds of the event are small, then the winning amount displayed to the user would be larger. In this way, betting on events having small odds would fetch users bigger price money while betting on events that are very likely to happen would bring users very small amounts.

Following are the events of the game on which users could place their bets - Team A winning against Team B; range of total score of the team at the end of the inning - range, over, and under; powerplay winner; most wicket taker; bowler giving most runs; highest run scorer.

Developing Ball Prediction Model

The ball prediction model is expected to predict the output of a ball, given several input variables [[ like the players involved, their historical performance, innings being played, runs to chase, wickets available, current score, overs remaining, etc ]]

About dataset

All the data to train the model is downloaded from https://cricsheet.org/

 

The dataset consisted of T20 matches of the following men's tournaments - Big Bash League, Caribbean Premier League, Indian Premier League, and T20 Blast. The dataset for only men's tournaments was chosen because the resulting simulation to be exposed to the users was only supposed to be for men's tournaments. Hence, adding women's tournaments to the dataset would have added bias to the dataset.

The dataset is in the form of JSON files. Each JSON file has some meta info about the match like the teams playing, the outcome of the match, players in each team, toss information, and venue. Further, the file has information about every ball played, including the current over, batter, bowler, non-striker, runs, extras, wicket, etc. Below is a sample JSON file for a random match:

Data Preprocessing

The data in the JSON files could not be loaded and fed to the model as input directly. Several preprocessing steps were required to extract and create the exact features required by the model.

The exact preprocessing steps required are implemented in utils/data_processing.py in the repo. Following is the list of steps - 

  1. Handling of Extras - The input files had different elements for denoting different extras, which lead to the input dataframe having multiple columns, one for each 'extras.noballs','extras.wides','extras.legbyes','extras.byes'. The wides and no balls were merged and a new column extras_flag was created. Legbyes and byes were not treated as extras for limiting the complexity of the model.

  2. Creation of a score column which is the cumulative sum of the total runs scored in every ball.

  3. Creation of wicket_flag to denote the loss of a wicket for batting team on that ball and a wickets_lost column denoting the total wickets lost till that point of time.

  4. Reading the player stats file (more on this later) for bowlers and batsmen, converting the player stats into probabilities, and attaching the probabilities into the dataframe as features.

  5. Adding the final class variable for that ball and some other data cleaning. The class variable represents which class the outcome of a ball falls in. I decided to divide all the possible outcomes of a ball into 6 classes as follows -
     

Player Stats

Different players react differently in the same situation. Aggressive players are aggressive while low-strike players play at their pace in the majority of the overs in the match. The nature of the player style matters a lot in deciding the outcome of the ball.

For the same reason, I decided to include the historical player statistics as input features to the model. 

I could not find the statistics I wanted for all the players at a place to scrap. So I used the ball-by-ball data of matches I had to create my own database of statistics. This would not exactly match the global database calculated using all the matches a player has played in his lifetime, but it would be a close match as far as playing style is considered.

My script read all the individual files of matches and collected several data points for bowlers and batsmen to produce their performance stats in a dictionary as per the following format - {'player_name': [0,1,2,3,4,6,Wicket]}

 

For batsmen, it translates to the number of times runs scored per ball as per the above dict and the number of times they got out. For bowlers, it translates to the number of times runs were given and wickets were taken. Extras and other outcomes of the balls were ignored for simplifying the model. These stats were saved into a file, and later to be used by the model as input. Later these counts were converted to probabilities before being used as input features.

Model Development

Simulating an entire T-20 match using ball prediction model

The ball prediction model takes several state variables from the match and predicts the outcome of the next ball. This outcome of the ball is used to change several state variables of the match and then the new state is used to predict the outcome of the next ball. This is repeated for the entire 20-over spell or until all wickets are lost by the batting team.

Setting up Monte-Carlo Simulation

Using distributions from Monte-Carlo Simulation to Calculate Odds

Modeling odds of winning of a team

References and Related reads -