Create a library
It is helpful to start a programme with a path macro: %let path=<folder path>;
To invoke the macro: libname <libref> “&path”;
Libref maximum 8 characters; letters, underscore or numbers; cannot start with numbers; no special characters.
Make a data set
Data step
Data <new name>;
Retain <variables>; can be used to set the order of variables in the table (must come before the set statement)
Set <libref.file>; libref not required if the file is in the working library
Merge <libref.file> (in=m1) <libref.file >…. ;
Length <variable>$ W or <variable> w.d; where W=number of print places for character variable or for numerical W=bites and .d = decimal places
Assign new variables; type new variable name and then define
Label;
Format;
Where; create a data subset by limiting the table to a particular level of a variable
Drop <variables>; if the drop statement is used within set statement then variables are not read into the PDV (drop= ) and are thus not available for processing
Keep <variables>; limits the variables that are saved to the new dataset ( can also be used within the merge line
If <m1>; use if merging
If <condition>; else <condition>;
By <variable>;
run;
Sorting data
Proc sort data= < >;
by <variable> descending or <variable> ascending ;
run;
List data variables
proc contents data=< > out=list noprint; run;
data < >;
retain varnum;
set list (keep=name type varnum label);
if label=name then delete;
name =upcase(name);
if type=2 then data_type=’character’; else if type=1 then data_type=’numeric’;
drop type;
run;
proc sort data=list2; by data_type name; run; *view this file in excel;
Descriptive statistics
Proc univariate
Proc univariate data= ;
Var …;
Id ;
Histogram / normal ;
run;
| Options | |
|---|---|
| Nextrobs=x | Number of extreme values to display |
| Pctldef=n | Specifies number of quantiles to display |
| Nolabels | Suppress labels (labels used automatically) |
| Mu0=n | Specifies mean to use for test of mean, if want to test different mean for each variable then list in order |
| Plot | Displays frequency histogram and normal quantile plot |
| Normaltest | Tests sample distribution under the assumption of normality |
| Cibasic (type= alpha=) | Requests confidence limits for mean, SD, variance under assumption of normality. Type=two sided alpha=0.05 |
Proc standard
Proc standard data= mean= std=1 out= ; run;
Proc means
Proc means data= ;
Var …;
Class ;
By ; must be sorted first
Output out= min=minvariable sum=sum max=max mean=etc…. ; outputs the specified descriptive statistics for a single variable by class variables
run;
NB: Nobs = number of observations with each unique combination of class variables; N = number of observations with nonmissing values
| Options | |
|---|---|
| Default statistics | n,mean, std, min, max |
| Specify individual statistics | clm, css, lclm, uclm, max, min, range, mean, mode, n, nmiss, std, sum, var, px (percentile) |
| Nonobs | Suppresses the nobs column |
| Nolabels | Suppress labels |
| Maxdec=n | Limit number of decimal places |
Proc freq
Proc freq data=< . > ;
Table * / ; the row is the grouping variable by which columns are compared
Where ;
By ;
Format ;
Tests ;
run;
NB: for dates, frequency count can be restrict to fewer discrete values (eg just years) by using format statement eg date4. (just years). User defined formats from proc format can also be applied.
| Options | |
|---|---|
| Nlevels | Displays no of distinct values for each variable |
| Order=freq | Displays results in descending order of frequency |
| Order=formatted | Displays in ascending order (can be used to reorder and used with format statement if needed) |
To test for duplicate ID values use nlevels in proc statement and noprint in table statement
Proc freq data= nlevels;
Table all / noprint;
run;
| Table options | |
|---|---|
| chisq | |
| Relrisk | Gives OR & RR with 95%CI – compares column variables by the X or grouping variable |
| Nocum | No cumulative statistics |
| Nopercent | Suppresses the percentage display |
| Norow | Suppresses the row percentage |
| Nocol | Suppresses the column percentage |
| Nofreq | Suppresses the frequency display |
| Noprint | Supresses the freq table output |
| Format=w. | Increase width of cells, used if labels are wrapped |
| Out=<data set> | Creates a data set with frequency and percentages |
| Output=<data set><options> | Creates a data set with specified statistics |
| Missing | Includes missing data in frequency table, otherwise ignored |
| Agree | Kappa & mcnemar’s |
| Test options | ||
|---|---|---|
| Test option | Asymptotic tests | Required tables statement option |
| Agree | Simpled & weighted kappa | agree |
Proc tabulate
Proc tabulate data=< . > ;
Class…; N non-missing is the default statistic
Var…; sum is the default statistic
Table , , (define statistics); all variables that are part of dimension expression must be specified in class or var statement
run;
Categorical and continuous variables can be crossed in the table statement eg var1*var2 in the column expression gives sum of continuous var2 by levels of var1.
Multiple tables can be created in a single step
Can incorporate a where statement
Out=<new data> can be inserted in the proc tabulate options
Other descriptive statistics: pctn rowpctn colpctn median p1 p5 p10 qrange
Identifying missing data
Continuous data
proc mi data=< >;
ods select misspattern;
run;
Categorical data
proc format;
value nm . = ‘.’ other = ‘X’;
value $ch ‘ ‘ = ‘.’other = ‘X’;
run;
proc freq data=< >;
Table… / list missing nocum;
format numeric nm. character $ch.;
run;