SAS

Create a library

It is helpful to start a programme with a path macro:  %let path=<folder path>;

To invoke the macro: libname <libref> “&path”;

Libref maximum 8 characters; letters, underscore or numbers; cannot start with numbers; no special characters.

Make a data set

Data step

Data <new name>;
Retain <variables>; can be used to set the order of variables in the table (must come before the set statement)
Set <libref.file>; libref not required if the file is in the working library
Merge <libref.file> (in=m1) <libref.file >…. ;
Length <variable>$ W or <variable> w.d; where W=number of print places for character variable or for numerical W=bites and .d = decimal places
Assign new variables; type new variable name and then define
Label;
Format;
Where; create a data subset by limiting the table to a particular level of a variable
Drop <variables>; if the drop statement is used within set statement then variables are not read into the PDV (drop= ) and are thus not available for processing
Keep <variables>; limits the variables that are saved to the new dataset ( can also be used within the merge line
If <m1>; use if merging
If <condition>; else <condition>;
By <variable>;
run;

Sorting data

Proc sort data= < >;
by <variable> descending or <variable> ascending ;
run;

List data variables

proc contents data=< > out=list noprint; run;

data < >;
retain varnum;
set list (keep=name type varnum label);
if label=name then delete;
name =upcase(name);
if type=2 then  data_type=’character’; else if type=1 then data_type=’numeric’;
drop type;
run;

proc sort data=list2; by data_type name; run; *view this file in excel;

Descriptive statistics

Proc univariate

Proc univariate data= ;
Var …;
Id ;
Histogram / normal ;
run;

Options
Nextrobs=xNumber of extreme values to display
Pctldef=nSpecifies number of quantiles to display
NolabelsSuppress labels (labels used automatically)
Mu0=nSpecifies mean to use for test of mean, if want to test different mean for each variable then list in order
PlotDisplays frequency histogram and normal quantile plot
NormaltestTests sample distribution under the assumption of normality
Cibasic (type= alpha=)Requests confidence limits for mean, SD, variance under assumption of normality. Type=two sided alpha=0.05
Proc standard

Proc standard data= mean= std=1 out= ; run;

Proc means

Proc means data= ;
Var …;
Class ;
By ; must be sorted first
Output out= min=minvariable sum=sum max=max mean=etc…. ; outputs the specified descriptive statistics for a single variable by class variables
run;

NB: Nobs = number of observations with each unique combination of class variables; N = number of observations with nonmissing values

Options
Default statisticsn,mean, std, min, max
Specify individual statisticsclm, css, lclm, uclm, max, min, range, mean, mode, n, nmiss, std, sum, var, px (percentile)
NonobsSuppresses the nobs column
NolabelsSuppress labels
Maxdec=nLimit number of decimal places
Proc freq

Proc freq data=< . > ;
Table * / ; the row is the grouping variable by which columns are compared
Where ;
By ;
Format ;
Tests ;
run;

NB: for dates, frequency count can be restrict to fewer discrete values (eg just years) by using format statement eg date4. (just years). User defined formats from proc format can also be applied.

Options
NlevelsDisplays no of distinct values for each variable
Order=freqDisplays results in descending order of frequency
Order=formattedDisplays in ascending order (can be used to reorder and used with format statement if needed)

To test for duplicate ID values use nlevels in proc statement and noprint in table statement
Proc freq data= nlevels;
Table all / noprint;
run;

Table options
chisq
RelriskGives OR & RR with 95%CI – compares column variables by the X or grouping variable
NocumNo cumulative statistics
NopercentSuppresses the percentage display
NorowSuppresses the row percentage
NocolSuppresses the column percentage
NofreqSuppresses the frequency display
NoprintSupresses the freq table output
Format=w.Increase width of cells, used if labels are wrapped
Out=<data set>Creates a data set with frequency and percentages
Output=<data set><options>Creates a data set with specified statistics
MissingIncludes missing data in frequency table, otherwise ignored
AgreeKappa & mcnemar’s
Test options
Test optionAsymptotic testsRequired tables statement option
AgreeSimpled & weighted kappaagree
Proc tabulate

Proc tabulate data=< . > ;
Class…; N non-missing is the default statistic
Var…; sum is the default statistic
Table , , (define statistics); all variables that are part of dimension expression must be specified in class or var statement
run;

Categorical and continuous variables can be crossed in the table statement eg var1*var2 in the column expression gives sum of continuous var2 by levels of var1.

Multiple tables can be created in a single step

Can incorporate a where statement

Out=<new data> can be inserted in the proc tabulate options

Other descriptive statistics: pctn rowpctn colpctn median p1 p5 p10 qrange

Identifying missing data

Continuous data
proc mi data=< >;
ods select misspattern;
run;

Categorical data
proc format;
value nm . = ‘.’ other = ‘X’;
value $ch ‘ ‘ = ‘.’other = ‘X’;
run;
proc freq data=< >;
Table… / list missing nocum;
format numeric nm. character $ch.;
run;