In ordinary linear regression, our primary measure of model fit was R2, which was an indicator of the percentage of variance in the dependent variable explained by the model. It would be useful for us to have a similar measure for logistic regression. However, the R2 measure is only appropriate to linear regression, with its continuous dependent variables. To get around this problem, a number of statisticians have developed so-called ‘Pseudo R2 ’ measures that aim to mimic R2 for logistic regression models. In contrast to the actual R2 , as these are approximations there are a number of different Pseudo R Squares, which take a different conceptual approach to what R2 means .
Putting aside my own biasness against such methods (using such pseudo methods for non-linear equations can be very misleading) – here are my 2 cents on the subject;
The reason you might find it hard to determine an “industry-standard” R-squared value is due to it not existing. In fact, statisticians are still debating what is the “correct” R-squared variation that should be used in assessing variance (others are arguing why not to use it). The value of R-squared depends on the marginal proportion of cases with events. In the uncorrected Cox & Snell this means that the maximum value that can be computed will change in a nonlinear manner, and the corrected version (Nagelkerke) is just a division by the maximum value (upper bound).
Cox & Snell
The ratio of the likelihoods reflects the improvement of the full model over the intercept model. The smaller the ratio, the greater the improvement and the higher the R-squared. So a value of 0.005 is very small. i.e. only 0.5% of the variance is explained by the variables.
Both indices are measures of strength of association (i.e. whether any predictor is associated with the outcome, as for an LR test), and can be used to quantify predictive ability or model performance. A single predictor may have a significant effect on the outcome but it might not necessarily be so useful for predicting individual response, hence the need to assess model performance as a whole (wrt. the null model).
So, in a nutshell – this is a half voodoo, and half science. Whether the value is truly significant depends on sample size, number of coefficients, variables, problem domain, transformation on data, etc. As a rule of thumb, something like R-squared = 0.50 is very good. However while you would want as high a value as possible, note that anything above 0.85 would mean there is a high correlation amongst your variables (i.e. they are too alike).
Focus on what matters in the analysis (the classification error rates). I can safely recommend describing the results of your analysis without reference to R2, which is more likely to mislead than not.
More details can be found in Harrell’s book, Regression Modeling Strategies (pp. 203-205, 230-244, 247-249).
Leave a comment