查看原文
其他

暗数据:为什么你不知道的事情很重要

常华Andy Andy730 2024-03-16

Source: <Dark Data: Why What You Don't Know Matters> by David J. Hand


在飞机上翻了一下。作者提出的问题非常有意思。


本书的主题不是“暗数据”。


而是数据的测量问题,也就是现实世界中存在的多个场景中的数据不能反映事实


作者列举出了 15 种此类场景。


这里简要总结一下:


  1. 我们知道的有数据丢失的场景。例如,调查问卷的时候没有反馈答案的人,问卷的结果就不能反映全面的真实情况。

  2. 我们不知道的有数据丢失的场景。例如,类似的例子,如果是在线调查问卷,你无法知道有哪些人没填写。

  3. 只挑选部分数据的场景。例如,科研人员选取对期望中结果相近的结果数据。

  4. 数据与现象之间,或者局部数据和整体数据之间的存在差异的场景。例如,辛普森悖论(Simpson's paradox)中,判断两个变量之间的关系时,在单个局部他们之间的关系,与整体上他们之间的关系可能相悖。

  5. 与时间有关系的数据的场景。例如,一个病人的检查数据,在检查的当时,可能不能反映真实的情况,或者,随着时间推移,数据发生变化。

  6. 数据收集和处理可能会影响数据本身的场景。例如,对数据做了处理,比如说是做了平均数,则与实际情况偏离。

  7. 对数据的理解不同相关的场景。例如,因为信息不对称,不同的人对同类数据的洞察不一样,也包括有意的数据修改和欺诈。

  8. 超出边界的数据的场景。在数据测量最大限值以外的数据,在数据收集和处理时已经被忽略了。


从上面可以看出:在数据收集、处理、解读的过程中,均会出现数据与客观事实不太一致的情况,导致结果偏差。


从本质上说,自然界本身没有数据,数据是人发明出来解读世界的。数据不能反映事实,只能无限接近事实。


从这个角度上来理解,就非常容易理解这些状况了。


应对这些状况,需要思考的是:

  1. 如何收集更全面的数据?即尽可能多的数据,包括知道不知道和不知道不知道的情况

  2. 如何更全面地收集数据?即尽可能不同形式的数据收集方式,包括一些数据统计学的处理手段

  3. 如何尽可能避免收集的数据与事实的偏差?即采取更体系化的手段整理数据


另外,大数据和 AI 并不能完美地解决这些问题,因为,喂给他们的数据本身就可能是有偏差的。


下面是原文,供参考:


DD-Type 1: Data We Know Are Missing

These are Rumsfeld's "known unknowns." They arise when we know there are gaps in the data, concealing values which could have been recorded. An example is table values that are missing, as in the marketing data extract in Table 1, or failure of people on a list to respond to an interview, either in part or at all. In the latter case, perhaps all we know about the respondents who refused to take part is their identifying information.


DD-Type 2: Data We Don't Know Are Missing

These are Rumsfeld's "unknown unknowns." We do not even know that we are missing data. An example arises in web surveys, for which we do not have a list of possible respondents, so we do not know who has failed to respond at all. The Challenger space shuttle disaster represented an oversight of this kind, as the teleconference attendees did not recognize they were missing some data.


DD-Type 3: Choosing Just Some Cases

Poor choice of criteria for inclusion in a sample, or poor application of reasonable criteria, can lead to sample distortions. A researcher might choose healthier patients; an investigator might choose people sympathetically inclined to a company being evaluated. A particular variant arises when "the best" are chosen from a large number of cases, since this is likely to lead to disappointment in the future as regression to the mean kicks in. Likewise, p-hacking and failure to allow for multiple hypotheses means scientific results might not be reproducible.


DD-Type 4: Self-Selection

Self-selection is a variant of DD-Type 3: Choosing Just Some Cases. It arises when people themselves can decide whether to be included in a database. Examples are nonresponse in surveys when the respondents choose whether or not to answer questions, patient databases for which patients can decide whether to have their data stored (opt in and opt out), and more generally in the choice of services that people make (e.g., a bank or supermarket). In all these examples, those included may well differ in some systematic way from those not included.


DD-Type 5: Missing What Matters

Sometimes a critical aspect of a system is entirely unobserved. This can lead to mistaken causal attributions, as when an increase in icecream sales is followed by grass drying out. Obviously, here the causal network is missing data about the weather—but it's not always so obvious that something is missing. A more troublesome example is Simpson's paradox, in which an overall rate can increase while all constituent rates decrease.


DD-Type 6: Data Which Might Have Been

Counterfactual data are the data we would have seen had we taken some other action or observed what happens under different conditions r circumstances. An example is a clinical trial in which each patient can receive only one treatment—perhaps because the aim of the trial is to investigate time to cure—so that once a patient has been cured, it is not possible to go back to explore the time the alternative treatment would have taken. Another example is the age of the spouse of someone who is unmarried.


DD-Type 7: Changes with Time

Time can hide data in many ways. For example, data might no longer be an accurate description of the current state of the world, cases might not be observed because they occur after the end of the observation period, cases might drop out because they change their nature, and so on. Examples include medical studies of survival times after diagnosis and when the observation period is terminated before a patient has died, and data describing a country's population 20 years ago, which might be of limited value for developing current public policy.


DD-Type 8: Definitions of Data

Definitions might be inconsistent and may change over time to better reflect their purpose and use. This can cause problems with economic (and other kinds of) time series, in which the data underlying them may cease to be collected. More generally, if people define concepts in different ways, they may well draw different conclusions. One example is UK crime statistics, which are measured by police records and by a survey of victims, since the two sources have different definitions of a crime.


DD-Type 9: Summaries of Data

By definition, summarizing data means discarding the details. If you report merely an average, you reveal nothing about the range of the data, and nothing about the skewness of the distribution. An average could conceal the fact that some values are very different indeed. Or, at the other extreme, it could conceal the fact that all the values are identical.


DD-Type 10: Measurement Error and Uncertainty

Measurement error leads to uncertainty about the underlying true value. This is most easily seen if we imagine a situation in which the range of measurement error is as large as or larger than the range of underlying true values, since then the observed value can be very different from the truth. Rounding, heaping, ceiling, floor effects, and others all inject uncertainty into the data, obscuring precise values. A different cause of uncertainty and inaccuracy is data linkage, in which identifying information might be stored in different styles, making matching error-prone.


DD-Type 11: Feedback and Gaming

This type of data arises when the values of data which have been collected influence the collection process itself—as in grade inflation and share-price bubbles. It means that the data are a distorted representation of the underlying reality—possibly drifting further from it as time progresses.


DD-Type 12: Information Asymmetry

Different data sets may be held by different people, and information asymmetry arises when one knows something that others don't. Examples are insider trading, Akerlof 's market for lemons, and international tensions arising from limited knowledge of an enemy country's capabilities.


DD-Type 13: Intentionally Darkened Data

This particular example of choosing just some cases is a particularly troublesome one. It arises when people deliberately conceal or manipulate data with intent to deceive or mislead—it is fraud. We saw that it can arise in many contexts and in many ways. 


DD-Type 14: Fabricated and Synthetic Data

When data are made up it might be with the intention to mislead, as in fraud. But it also occurs in simulation, when artificial data sets which could have arisen from the process being studied are generated, and in other applications in which data are replicated, such as bootstrap, boosting, and smoothing. Modern statistical tools make extensive use of such ideas, but poor replication can result in misleading conclusions.


DD-Type 15: Extrapolating beyond Your Data

Data sets are necessarily always finite. That means they have a maximum and a minimum value, beyond which lies the unknown. Making statements about possible values above the maximum or below the minimum in a data set requires that assumptions must be made, or that information is acquired from some other source. We saw an example of this with the Challenger disaster, where the launch occurred at an ambient temperature below any previously experienced.

继续滑动看下一个
向上滑动看下一个

您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存