SAS – NeoResearch

Create a library
Make a data set
Descriptive statistics

Create a library

It is helpful to start a programme with a path macro: %let path=<folder path>;

To invoke the macro: libname <libref> “&path”;

Libref maximum 8 characters; letters, underscore or numbers; cannot start with numbers; no special characters.

Make a data set

Data step

Data <new name>;
Retain <variables>; can be used to set the order of variables in the table (must come before the set statement)
Set <libref.file>; libref not required if the file is in the working library
Merge <libref.file> (in=m1) <libref.file >…. ;
Length <variable>$ W or <variable> w.d; where W=number of print places for character variable or for numerical W=bites and .d = decimal places
Assign new variables; type new variable name and then define
Label;
Format;
Where; create a data subset by limiting the table to a particular level of a variable
Drop <variables>; if the drop statement is used within set statement then variables are not read into the PDV (drop= ) and are thus not available for processing
Keep <variables>; limits the variables that are saved to the new dataset ( can also be used within the merge line
If <m1>; use if merging
If <condition>; else <condition>;
By <variable>;
run;

Sorting data

Proc sort data= < >;
by <variable> descending or <variable> ascending ;
run;

List data variables

proc contents data=< > out=list noprint; run;

data < >;
retain varnum;
set list (keep=name type varnum label);
if label=name then delete;
name =upcase(name);
if type=2 then data_type=’character’; else if type=1 then data_type=’numeric’;
drop type;
run;

proc sort data=list2; by data_type name; run; *view this file in excel;

Descriptive statistics

Proc univariate

Proc univariate data= ;
Var …;
Id ;
Histogram / normal ;
run;

Options
Nextrobs=x	Number of extreme values to display
Pctldef=n	Specifies number of quantiles to display
Nolabels	Suppress labels (labels used automatically)
Mu0=n	Specifies mean to use for test of mean, if want to test different mean for each variable then list in order
Plot	Displays frequency histogram and normal quantile plot
Normaltest	Tests sample distribution under the assumption of normality
Cibasic (type= alpha=)	Requests confidence limits for mean, SD, variance under assumption of normality. Type=two sided alpha=0.05

Proc standard

Proc standard data= mean= std=1 out= ; run;

Proc means

Proc means data= ;
Var …;
Class ;
By ; must be sorted first
Output out= min=minvariable sum=sum max=max mean=etc…. ; outputs the specified descriptive statistics for a single variable by class variables
run;

NB: Nobs = number of observations with each unique combination of class variables; N = number of observations with nonmissing values

Options
Default statistics	n,mean, std, min, max
Specify individual statistics	clm, css, lclm, uclm, max, min, range, mean, mode, n, nmiss, std, sum, var, px (percentile)
Nonobs	Suppresses the nobs column
Nolabels	Suppress labels
Maxdec=n	Limit number of decimal places

Proc freq

Proc freq data=< . > ;
Table * / ; the row is the grouping variable by which columns are compared
Where ;
By ;
Format ;
Tests ;
run;

NB: for dates, frequency count can be restrict to fewer discrete values (eg just years) by using format statement eg date4. (just years). User defined formats from proc format can also be applied.

Options
Nlevels	Displays no of distinct values for each variable
Order=freq	Displays results in descending order of frequency
Order=formatted	Displays in ascending order (can be used to reorder and used with format statement if needed)

To test for duplicate ID values use nlevels in proc statement and noprint in table statement
Proc freq data= nlevels;
Table all / noprint;
run;

Table options
chisq
Relrisk	Gives OR & RR with 95%CI – compares column variables by the X or grouping variable
Nocum	No cumulative statistics
Nopercent	Suppresses the percentage display
Norow	Suppresses the row percentage
Nocol	Suppresses the column percentage
Nofreq	Suppresses the frequency display
Noprint	Supresses the freq table output
Format=w.	Increase width of cells, used if labels are wrapped
Out=<data set>	Creates a data set with frequency and percentages
Output=<data set><options>	Creates a data set with specified statistics
Missing	Includes missing data in frequency table, otherwise ignored
Agree	Kappa & mcnemar’s

Test options
Test option	Asymptotic tests	Required tables statement option
Agree	Simpled & weighted kappa	agree

Proc tabulate

Proc tabulate data=< . > ;
Class…; N non-missing is the default statistic
Var…; sum is the default statistic
Table , , (define statistics); all variables that are part of dimension expression must be specified in class or var statement
run;

Categorical and continuous variables can be crossed in the table statement eg var1*var2 in the column expression gives sum of continuous var2 by levels of var1.

Multiple tables can be created in a single step

Can incorporate a where statement

Out=<new data> can be inserted in the proc tabulate options

Other descriptive statistics: pctn rowpctn colpctn median p1 p5 p10 qrange

Identifying missing data

Continuous data
proc mi data=< >;
ods select misspattern;
run;

Categorical data
proc format;
value nm . = ‘.’ other = ‘X’;
value $ch ‘ ‘ = ‘.’other = ‘X’;
run;
proc freq data=< >;
Table… / list missing nocum;
format numeric nm. character $ch.;
run;