How does stata treat missing observations




















It will describe how to indicate missing data in your raw data files, as well as how missing data are handled in Stata logical commands and assignment statements. We will illustrate some of the missing data properties in Stata using data from a reaction time study with eight subjects indicated by the variable id , and the subjects reaction times were measured at three time points trial1, trial2 and trial3.

The input data file is shown below. You might notice that some of the reaction times are coded using a single. The person measuring time for that trial did not measure the response time properly; therefore, the data point for the second trial is missing.

As a general rule, Stata commands that perform computations of any type handle missing data by omitting the row with the missing values.

As you see in the output below, summarize computed means using 4 observations for trial1 and trial2 and 6 observations for trial3.

In short, the summarize command performed the computations on all the available data. A second example shows how the tabulation or tab1 command handles missing data. Like summarize, tab1 uses just available data. Note that the percentages are computed based on the total number of non-missing cases.

It is possible that you might want the percentages to be computed out of the total number of observations, and the percentage missing for each variable shown in the table. This can be achieved by including the missing option which can be shortened to m after the tabulation command. We would expect that it would perform the computations based on the available data and omit the missing values. Here is an example command.

The output is show below. Note how the missing values were excluded. Stata will perform listwise deletion and only display correlation for observations that have non-missing values on all variables listed. Stata also allows for pairwise deletion. Correlations are displayed for the observations that have non-missing values for each pair of variables. This can done using the pwcorr command. We use the obs option to display the number of observation used for each pair.

As you can see, they differ depending on the amount of missing. This may seem obvious, but I have had many students nonchalantly say "oh, so we can just replace those with zeros Consider this in the context of gas mileage. Different statistical software code missing data differently. In Stata, if your variable is numeric and you are missing data, you will see.

If you are working with string variables, the data will appear as [blank]. Missing data values will affect how Stata handles your data. These problems can be solved with similar methods. A different situation, not addressed directly in this FAQ, is when values of some time-varying variable are known only for certain observations.

There is then a need for imputation or interpolation between known values. Copying the last value forward is unlikely to be a good method of interpolation unless, as just stated, it is known that values remained constant at a stated level until the next stated level. Either way, users applying the methods described here for imputation or interpolation take on the responsibility for what they do.

Note that all the interpolation methods mentioned here and some others are directly implemented in the community-contributed command mipolate which is downloadable from SSC. Let us first look at the case where you have not tsset your data see, for example, [TS] tsset for an explanation , but we will assume that the data have been put in the correct sort order, say, by typing.

There is not, of course, any observation before the first, or after the last, so myvar[0] is always missing, as is myvar for any observation number that is negative or greater than the number of observations in the data. See [U] If myvar is numeric, you could write. Most problems involve missing numeric values, so, from now on, examples will be for numeric variables only.

However, if myvar were string,. Missing values may occur in blocks of two or more. Suppose you want to replace missings by the previous nonmissing value, whenever it occurred, so that given. To get this, it helps to know that replace always uses the current sort order: the value for observation 2 is always replaced before that for observation 3, so the replacement value for 2 may be used in calculating the replacement value for 3. But myvar[3] is replaced by the new value of myvar[2] , 42, not its original value, missing.

In this way, nonmissing values are copied in a cascade down the current sort order. Naturally, one or more missing values at the start of the data cannot be replaced in this way, as no nonmissing value precedes any of them.



0コメント

  • 1000 / 1000