missing values

Jakson Aquino jaksonaquino at yahoo.com.br
Sun Feb 6 22:46:14 CET 2005


Hello All!

I didn't look at the source code carefuly, but from
the "behavior" of Statist it seems to me that we have
a problem. 

Suppose that a research team applied a questionnaire
to 600 people, asking them 50 questions. Some question
wouldn't apply to every one, and many people wouldn't
answer many questions. Now, I have a database with 50
columns and 600 rows, and many missing values. Every
time I want to do some correlation between variables I
will have to do:

1) Open with Statist a version of the database with no
'M' inside. The missing values would be indicated by
some valid value.

2) Save the columns that I want to correlate in a
ASCII file.

3) Quit Statist

4) Use an external program to recode the new and
smaller database to turn into 'M' all values that, in
fact, are missing.

5) Load the new small recoded database with the option
-delrow and, then, run the analysis. 

While still becoming familiar with the database, I
should run many analyzes, mainly regression analyzes
or multiple linear correlations, with different
combinations of variables. It's easy to imagine how
boring this work would be.

Statist always deletes missing values while reading a
database. The option -delrow mean only that the entire
row, instead of just one data point, must be deleted.
That's the wrong approach because the rows can be
broken without the option -delrow, and valid values
can be deleted with the option -delrow. Data in the
same row are supposed to be a single case, and they
can't be split in different rows. Thus, rows with
missing values should not be deleted. They should be
ignored during calculations, when necessary, not while
loading a data file. And the problem can be more than
a mere inconvenient; it can be a bug if we don't
follow the five steps described above, and, unluckily,
have one of the two circumstances described below:

  1) A data base is opened without the option -delrow,
the columns used in a calculation have missing values,
and, for coincidence, they have the same number of
missing values, but not in the same rows. The bug
happens because Statist only tests whether columns
have the same number of data points before proceeding.
In this case, the correlation for these columns would
proceed and Statist would report some wrong result.

  2) The option -delrow is used, but the analysis
don't uses all columns, and the columns don't have
their missing values at exactly the same rows,
consequently some rows that should be used in the
analysis were deleted while opening the data base.
Again, Statist would proceed with the analysis and
report some result.

We need some way of keeping Statist aware of missing
values, instead of simply deleting them and testing
whether the columns have the same number of data
points. Perhaps we can define the biggest possible
double as missing value! [I don't know what the
biggest double value is.] That is, whenever Statist
finds 'M' in the database it puts 9.99999e+99 in its
memory, and whenever it finds 9.99999e+99 in its
memory it assumes that it is a missing value. While
writing columns as ASCII files, missing would come
back to the usual 'M'. And, just for precaution, if it
finds 9.99999e+99 while reading an ASCII file, it will
reports a error. Of course this will slow down the
computations, since an if(xx[i] == 9.99999e+99) will
be necessary when reading each single data point from
memory. And, the worst, it would be necessary to fix
all functions! Any better solution?

Anyway, we need to write a documentation in English
and include this information in the documentation.

Am I wrong, and there is no problem with the way
Statist deal with missing
values?

Best regards,

Jakson



__________________________________________________
Converse com seus amigos em tempo real com o Yahoo! Messenger 
http://br.download.yahoo.com/messenger/ 



More information about the Statist-list mailing list