# 从统计和机器学习的关系，反思数据科学，指出未来方向

Leo Breiman，加州大学伯克利一个很有名的教授（CART决策树、bagging及随机森林发明者）是最早意识到经典统计学界问题的先驱者，故而在2001 写了一篇及其重要的文章《统计建模：两种文化》(Statistical Modeling: The Two Cultures)。该文章狠批了把数据限制在假定模型中的经典统计学界，然后大力推广他在商业咨询中用机器学习做算法模型的有效经验。涉及的两种文化包括：

• Data model 是指一些模型认为数据的生成是已知的，是可以假设的。统计模型通常是假定了数据的生成过程，假定了模型变量的分布，是数据模型。

• Algorithm model，是假定数据的生成过程是未知的和复杂的，一些机器学习，深度学习算法通常是算法模型。

• 港大统计系系主任在2018年会上，呼吁系里面的老师用于拥抱AI。

• 美国两院院士统计学郁彬教授在去年在北大做报告的时候，批评北大统计系的老师眼里只有四大期刊，把自己圈子越做越少，呼吁新时代的统计学应该包括机器学习。

• 普林斯顿统计学教授范剑青今年刚刚发表第一篇关于 deep learning 的综述 on arxiv。

• Explainability 可解释性。

• lack of understanding cause-effect relationships 没有因果推断的能力。

Pearl 提出解决现在的困境必须让机器学习因果推断，具体来说就是回答如下问题。

How can machines represent causal knowledge in a way that would enable them to access the necessary information swiftly, answer questions correctly, and do it with ease


Hernan(2019) 认为我们现在需要重新定义数据科学，需要因果推断放在数据科学的核心位置，数据科学的任务包括三类， 描述，预测和反事实预测 ，具体来说：

• Description is using data to provide a quantitative summary of certain features of the world.

• Prediction is using data to map some features of the world to other features of the world.

• Counterfactual prediction is using data to predict certain features of the world as if the world had been different, which is required in causal inference applications.

what if I had been acted differently?

Hernan(2019) 中最后的结论是：

Data science is a component of many sciences, including the health and social ones. Therefore, the tasks of data science are the tasks of those sciences—description, prediction, causal inference. A sometimes-overlooked point is that a successful data science requires not only good data and algorithms, but also domain knowledge (including causal knowledge) from its parent sciences.

The current rebirth of data science is an opportunity to rethink data analysis free of the historical constraints imposed by traditional statistics, which have left scientists ill-equipped to handle causal questions. While the clout of statistics in scientific training and publishing impeded the introduction of a unified formal framework for causal inference in data analysis, the coining of the term “data science” and the recent influx of “data scientists” interested in causal analyses provides a once-in-a-generation chance of integrating all scientific questions, including causal ones, in a principled data analysis framework. An integrated data science curriculum can present a coherent conceptual framework that fosters understanding and collaboration between data analysts and domain experts.


• Breiman(2001) Statistical Modeling: The Two Cultures

• Jianqing Fan(2019) A Selective Overview of Deep Learning https://arxiv.org/abs/1904.05526

• Pearl(2019) The Seven Tools of Causal Inference, with Reflections on Machine Learning

• Hernan(2019) A Second Chance to Get Causal Inference Right: A Classification of Data Science Tasks