# statistics for data science topics

After understanding the important topics of mathematics, we will now take a look at some of the important concepts of statistics for data science – Statistics for Data Science. It has various methods that are helpful to solve the most complex problems of real life. Median is used over the mean since it is more robust to outlier values. And just a heads up, I support this blog with Amazon affiliate links to great books, because sharing great books helps everyone! Our Over and Under Sampling can combat that. ST343 Topics in Data Science Previous page ; Next page; Throughout the 2020-21 academic year, we will be adapting the way we teach and assess your modules in line with government guidance on social distancing and other protective measures in response to Coronavirus. First, every data scientist needs to know some statistics and probability theory. Statistics is a Mathematical Science pertaining to data collection, analysis, interpretation and presentation. Don’t Learn Machine Learning. Bio: Sergey Feldman is a machine learning and data science consultant. Math Needed for Data Science. If you start data science directly with python , R and so on , you would be dealing with lot of technology things but not the statistical things. In both the left and right side of the image above, our blue class has far more samples than the orange class. Read more about it in this tutorial. Statistics can be a powerful tool when performing the art of Data Science (DS). Data science and data analysts use it to have a look on the meaningful trends in the world. Statistical features is probably the most used statistics concept in data science. In a nutshell, frequentists use probability only to model sampling processes. That was easy! Check out the graphic below for an illustration. A year ago, I was a numbers geek with no coding background. So, if you want to know that what are the topics under Data Science then here is a list that elucidates the same. We can define probability as the percent chance that some event will occur. … If … This means they only assign probabilities to describe data they've already collected. Statistical features is probably the most used statistics concept in data science. The min and max values represent the upper and lower ends of our data range. Two weeks later, I realized that I could learn everything I needed through edX, Coursera, and Udacity instead. So, if you want to know that what are the topics under Data Science then here is a list that elucidates the same. Statistics is one of the most crucial subjects for the students. A box plot perfectly illustrates what we can do with basic statistical features: All of that information from a few simple statistical features that are easy to calculate! Connect with me on LinkedIn too! It’s often the first stats technique you would apply when exploring a dataset and includes things like bias, variance, mean, median, percentiles, and many others. For a non-expert, what's the difference between Bayesian and frequentist approaches? Categories In Statistics. At the same time we take into account our evidence of the loaded die, if it’s true or not based on both its own prior and the frequency analysis. The copies will be made such that the distribution of the minority class is maintained. Original. Wasserman is a professor of statistics and data science at Carnegie Mellon University. Follow me on twitter where I post all about the latest and greatest AI, Technology, and Science! Related: Why Big Data is in Trouble: They Forgot About Applied Statistics; Big Data, Bible Codes, and Bonferroni; 15 Mathematics MOOCs for Data Science = There are two main categories in Statistics, namely: Descriptive Statistics. Statistics and Data Science About the course Draft Schedule JupyterHub for class nbgrader Discussion Forum Preliminaries Probability Probability Topics Random Variables Conditonal Probability Bayes’ Theorem Independence Empirical Distribution Expectation Covariance and Correlation Another way we can do dimensionality reduction is through feature pruning. We know that mathematical science known as statistics is what helps us deal with all this kind of information. It seeks to quickly bring computer science students up-to-speed with probability and statistics. The Bayesian side is more relevant when learning statistics for data science. The third quartile is the 75th percentile; i.e 75% of the points in the data fall below that value. In this case, we have 2 pre-processing options which can help in the training of our Machine Learning models. It’s all fairly easy to understand and implement in code! Over and Under Sampling are techniques used for classification problems. With dimensionality reduction we would then project the 3D data onto a 2D plane. For example, if you wanted to roll the die 10,000 times, and the first 1000 rolls you got all 6 you’d start to get pretty confident that that die is loaded! For example, after exploring a dataset we may find that out of the 10 features, 7 of them have a high correlation with the output but the other 3 have very low correlation. He's the founder of Data Cowboys, and lives in Seattle. One of the philosophical debates in statistics is between Bayesians and frequentists. Using statistics, we can gain deeper and more fine grained insights into how exactly our data is structured and based on that structure how we can optimally apply other data science techniques to get even more information. If we see a Gaussian Distribution we know that there are many algorithms that by default will perform well specifically with Gaussian so we should go for those. The test has a mean score of 150 and a standard deviation of 20. I recommend start with statistics first using simple excel and the later apply the same using python and R. Below are the topics covered in this course. Try these out whenever you need a quick yet informative view of your data. The P(E|H) in our equation is called the likelihood and is essentially the probability that our evidence is correct, given the information from our frequency analysis. Check out the graphic below for an illustration. PCA can be used to do both of the dimensionality reduction styles discussed above. Make learning your daily ritual. Now with today’s computing 1000 points is easy to process, but at a larger scale we would run into problems. Statistics is the study of collection, analysis, visualization and interpretation of the data. Bayesian Statistics does take into account this evidence. And with Poisson we’ll see that we have to take special care and choose an algorithm that is robust to the variations in the spatial spread. The book is ambitious. For example, we have 2000 examples for class 1, but only 200 for class 2. *Seating for each class is limited to 40 students. Well most people would just say that it’s 1 in 6. The cube represents our dataset and it has 3 dimensions with a total of 1000 points. Use it whenever you feel that your prior data will not be a good representation of your future data and results. That’ll throw off a lot of the Machine Learning techniques we try and use to model the data and make predictions! Instructors: Michael Mahometa, Erika Hale, and Sally Ragsdale are statistical consultants for the Department of Statistics and Data Sciences.Click here to see their full bios.. Then those 3 low correlation features probably aren’t worth the compute and we might just be able to remove them from our analysis without hurting the output. Hands-on real-world examples, research, tutorials, and cutting-edge techniques delivered Monday to Thursday. It involves applying math to analyze the probability of some event occurring, where specifically the only data we compute on is prior data. Data Science is like a powerful sports-car that runs on statistics. Want to Be a Data Scientist? *Topics Short Courses are for current UT Austin faculty, staff, and students. This effectively reduces the number of points we need to compute on to 100, a big computational saving! Take a look, I created my own YouTube algorithm (to stop me wasting time), All Machine Learning Algorithms You Should Know in 2021, 5 Reasons You Don’t Need to Learn Machine Learning. In data science this is commonly quantified in the range of 0 to 1 where 0 means we are certain this will not occur and 1 means we are certain it will occur. We have a guide for that: How to Learn Statistics for Data Science, The Self-Starter Way; What about other types of math?