如何在Python中将总和归零约束添加到GLM?

问题描述:

我有一个使用statsmodel glm函数在Python中建立的模型,但是现在我想将零约束总和添加到模型中.

I have a model set up in Python using the statsmodel glm function but now I want to add a sum to zero constraint to the model.

模型定义如下:

import statsmodels.formula.api as smf
model = smf.glm(formula="A ~ B + C + D", data=data, family=sm.families.Poisson()).fit()

在R中,要添加约束,我将简单地执行以下操作:

In R, to add the constraint, I would simply do something like this:

model <- glm(A ~ B + C + D –1, family=poisson(), data=data, contrasts=list(C="contr.sum", D="contr.sum"))

这将总和添加到C和D的零约束上,但是我不确定如何在Python中实现相同的目标.

That adds the sum to zero constraint to both C and D but I am not sure how to achieve the same in Python.

我已经看到有一个fit_constraint()方法可用,但是我不太确定如何使用它,或者甚至不能正确使用它来实现我的要求.

I have seen that there is a fit_constraint() method available but I am not too sure how to use it or if it is even the right thing to use to achieve what I require.

http://statsmodels.sourceforge.net/devel/generation/statsmodels.genmod.generalized_linear_model.GLM.fit_constrained.html#statsmodels.genmod.generalized_linear_model.GLM.fit_constrained

任何人都可以为应用此约束提供任何建议吗?

Can anyone offer any advice to applying this constraint?

以下是使用高斯族说明fit_constrained的示例,因为我没有很快找到带有分类变量的Poisson示例

Here is an example to illustrate fit_constrained, using Gaussian family since I didn't quickly find a Poisson example with categorical variables

import pandas
import statsmodels.api as sm
from statsmodels.formula.api import glm

url = 'http://www.ats.ucla.edu/stat/data/hsb2.csv'
hsb2 = pandas.read_table(url, delimiter=",")

mod = glm("write ~ C(race) - 1", data=hsb2)
res = mod.fit()
print(res.summary())

约束所有系数加到零

res_c = mod.fit_constrained('C(race)[1] + C(race)[2] + C(race)[3] + C(race)[4] = 0')
print(res_c.summary())

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                  write   No. Observations:                  200
Model:                            GLM   Df Residuals:                      197
Model Family:                Gaussian   Df Model:                            2
Link Function:               identity   Scale:                   1232.08314649
Method:                          IRLS   Log-Likelihood:                -993.41
Date:                Wed, 25 Mar 2015   Deviance:                   2.4149e+05
Time:                        16:42:37   Pearson chi2:                 2.41e+05
No. Iterations:                     1                                         
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
C(race)[1]     1.0002    221.565      0.005      0.996      -433.260   435.260
C(race)[2]   -41.1814    267.253     -0.154      0.878      -564.988   482.626
C(race)[3]    -6.3498    235.771     -0.027      0.979      -468.453   455.754
C(race)[4]    46.5311    100.184      0.464      0.642      -149.827   242.889
==============================================================================

Model has been estimated subject to linear equality constraints.

约束以逗号分隔,默认为零:

constraints are comma separated and default to equal zero:

res_c2 = mod.fit_constrained('C(race)[1] + C(race)[2], C(race)[3] + C(race)[4]')
print(res_c2.summary())

最后打印

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                  write   No. Observations:                  200
Model:                            GLM   Df Residuals:                      198
Model Family:                Gaussian   Df Model:                            1
Link Function:               identity   Scale:                   1438.99574167
Method:                          IRLS   Log-Likelihood:                -1008.9
Date:                Wed, 25 Mar 2015   Deviance:                   2.8204e+05
Time:                        16:42:37   Pearson chi2:                 2.82e+05
No. Iterations:                     1                                         
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
C(race)[1]    13.6286    242.003      0.056      0.955      -460.689   487.946
C(race)[2]   -13.6286    242.003     -0.056      0.955      -487.946   460.689
C(race)[3]   -41.6606    111.458     -0.374      0.709      -260.115   176.794
C(race)[4]    41.6606    111.458      0.374      0.709      -176.794   260.115
==============================================================================

Model has been estimated subject to linear equality constraints.

我不确定patsy公式的工作方式,以便在存在多个分类解释变量的情况下不会删除任何级别.

I'm not sure how patsy formulas work so that none of the levels is dropped if there are several categorical explanatory variables.