SDA 3.5 Documentation for CORRTAB

NAME

corrtab - crosstabular breakdown of correlations

USAGE

corrtab -b batchfile

DESCRIPTION

CORRTAB displays the correlations between two variables (the X- variable and the Y-variable) in a crosstabular format. Ordinarily this program is invoked by the Web interface for the SDA programs, and the user does not have to deal with the keywords given in this document. Output from the program is in HTML, which can be viewed with a Web browser.

It is also possible to run the program directly by preparing a command file, which specifies the variables to be analyzed and the options to use. This document explains how to prepare such a file. The name of this batch command file is specified to the program after the ‘-b’ option flag.

KEYWORDS

The batch file contains specifications for the analysis. These specifications are given in the form "keyword = something" with one keyword per line. Keywords may be given in any order, either in upper or in lower case. The valid keywords are as follows (with significant characters shown in capital letters):


Keyword       Possible Specification          Default (if no keyword)
_____________________________________________________________________


STUdy=        path of dataset directory       Look for variables in
                                                current directory only

XVAR=         name(s) of 1st variable         REQUIRED
               (separated by spaces/commas)
YVAR=         name(s) of 2st variable         REQUIRED

ROWvar=       variable name(s)                REQUIRED
               (separated by spaces/commas)
COLUMNvar=    variable name(s)                No column variable

CONtrolvar=   variable name(s)                No control variable

Weight=       name of weight variable         No weighting

Filter=       name(s) and codes of filter     No filter
                variable(s)


COLORcoding=  Yes                             No color coding of
                                                coefficients or headings

GVARCase=     LOWER or UPPER                  No force to lower/upper case

LAnguagefile= Name of file with non-English   English labels on
                labels and messages             output

RUNtitle=     Title or comments for run       No title or comments
                (1 line only)

SAvefile=     filename to receive output      Output sent to screen
                (overwrite existing file)       (standard output)

TExt=         Yes                             No text for variables

Statistic to display

The main statistic to display in each cell of the table can be one of two options: the Pearson correlation coefficient, or the log of the odds ratio. The default main statistics to display are the Pearson correlation coefficients.

Instead of displaying the main statistic directly, it is possible to display the DIFFERENCE from something else, by adding the ‘difference=’ keyword.

For each statistic the user can specify the number of desired decimal places (in parentheses, after the name of the statistic). See below for the default number of decimals for each statistic.


Keyword       Possible Specification          Default (if no keyword)
_____________________________________________________________________


MAINstat=     CORR (ndec)                     Display correlations,
              LOGodds (ndec)                    with default number
                                                of decimal places

DIFference=   Overall (ndec)                  Display main statistic
              (diff from overall correlation)

Optional statistics

In addition to the main statistic, one or more of the following optional statistics can be displayed in each cell (with the desired number of decimal places in parentheses if the defaults, listed below, are not satisfactory. Note that the ‘statisitics=’ keyword can be repeated on subsequent lines if necessary.


Keyword       Possible Specification          Default (if no keyword)
_____________________________________________________________________


STAtistics=
              SE (ndec)                       No standard errors
              TSTATistic (ndec)               No t-statistic in cells


              Ncases                          No unweighted N’s
              WNcases (ndec)                  No weighted N’s

DICHOTOMIZING VARIABLES

The calculation of the odds ratio assumes that the two variables have only two categories each. If these statistics are requested, CORRTAB treats the X-variable and the Y-variable as dichotomies, regardless of the number of categories they may actually have. The minimum valid value of each variable is treated as the base category (coded 0), and all valid values greater than the minimum are combined into the other category (coded 1). If this default dichotomization is not appropriate for a particular variable, you can specify another temporary recode after the variable name is given.

CALCULATION OF STANDARD ERRORS

If standard errors are requested, they are computed with the standard formulas for each statistic or its transformation. Note that the confidence interval for the Pearson correlation coefficient is not symmetric; therefore, there is no single standard error that applies in both directions. CORRTAB outputs the average distance of the upward and the downward confidence band for one standard error (based on the retransformation of Fisher’s Z), since that number is ordinarily a useful approximation. However, if cell sizes are small or the correlations of interest are close to zero or one, this average may not be good enough to make statistical inferences. In such a case (or when in doubt) use Fisher’s transformation and its associated standard error to carry out statistical tests on the corresponding Pearson correlations.

Note that the calculation of the standard error of the correlation coefficient in each cell is based by default on the UNWEIGHTED number of cases, even if a weight variable has been used for calculating the correlation coefficient. Ordinarily this procedure will generate a more appropriate statistical test than one based on the weighted N in each cell.

ABBREVIATIONS

Keywords can usually be abbreviated down to the number of characters required to differentiate them from other keywords. Sometimes only one character is required. The keyword for the weight variable, for instance, can be given as "weight=" or "wei=" or even "w=". Either upper or lower case may be used. In the list of keywords above, the minimum string of characters required for each specification is shown in capital letters.

COMMENTS

Anything on a line beginning with "#" is ignored by the batch processor and can therefore be used for comments. Blank lines are also ignored.

DECIMAL PLACES

Each statistic has a default number of decimal places with which it will be printed. To change the default, put the desired number of decimals in parentheses after specifying the statistic. The default number of decimal places are as follows:

main statistics: 2 (correlations, logs of odds ratios, and their differences)
se: 3
Tstatistic: 2
wncases: 0

It is not necessary to request the ‘correlation’ main statistic unless you want to change the number of decimal places. Unless otherwise specified, the Pearson correlation coefficient is the statistic that will be displayed.

MENTION OF KEYWORD SUFFICIENT

The form ‘keyword=yes’ may be shortened to ‘keyword’. That is, the ‘=yes’ may be omitted for those options which require no further specification. For example, ‘text=yes’ can be shortened to ‘text’.

ORDER OF PROCESSING LISTS

When more than one variable is given for the x, y, row, column, or control variable specifications, the tables are produced in the following order: Tables for EACH of the control variables are produced with the FIRST column variable and the FIRST row variable and the FIRST pair of x and y variables. Then the whole list of control variables is processed again for the SECOND column variable and the FIRST row variable and the FIRST pair of x and y variables; and so on until the whole set of column variables has been processed. Then the whole series is repeated for the SECOND row variable; and so on until all the row variables have been used. Then the whole series is repeated for the SECOND Y-variable; and so on until all the Y-variables have been used. Finally, the whole series is repeated for each succeeding X-variable.

Briefly, the variables will cycle in the following order: control, column, row, Yvar, Xvar. All of the tables will be produced using the same weight, filters, and other options.

REPETITION OF KEYWORDS

If there is not enough room on a line to list all of the desired variables, the keyword can be repeated on a new line, and more variables can be listed. In such a case the second list is appended to the first list, for purposes of generating tables. This appending feature applies to the keywords for specifying the x and y variables, the row, column, control, and filter variables, and the ‘statistics=’ keyword. If other keywords are repeated, the program will print an error message and stop.

EXAMPLES OF BATCH FILES

Basic example


     study = /sa/nes84

     xvar = spend
     yvar = spend2
     row = education
     column = gender

     savefile = mytables

Using more options

Specify multiple sets of variables, redefine some ranges,
and use weight and filter variables.


     xvar = spend spend2 spend3
     yvar = age educ
     row = var1(1-9) var2 var3(0-9)
     column = var3, var4

     weight= wtvar
     filters= var21(1-3) var30(1)

     savefile = mytables

Differences and other options

Put differences instead of original correlations in each cell,
and request some text options


     xvar = spend
     yvar = spend2

     row = var1 var2
     column = var4 var5

# Display differences (with 3 decimal places) from the overall
   correlation coefficient

     differences = overall(3)

# Request that full text of the variables be printed,
#  and put a run title or comment on the top of each page

     text= yes
     runtitle= Test run to demonstrate program

     savefile= mytables

CSM, UC Berkeley
April 12, 2011