如何在Python中将总和归零约束添加到GLM?
我有一个使用statsmodel glm
函数在Python中建立的模型,但是现在我想将零约束总和添加到模型中.
I have a model set up in Python using the statsmodel glm
function but now I want to add a sum to zero constraint to the model.
模型定义如下:
import statsmodels.formula.api as smf
model = smf.glm(formula="A ~ B + C + D", data=data, family=sm.families.Poisson()).fit()
在R中,要添加约束,我将简单地执行以下操作:
In R, to add the constraint, I would simply do something like this:
model <- glm(A ~ B + C + D –1, family=poisson(), data=data, contrasts=list(C="contr.sum", D="contr.sum"))
这将总和添加到C和D的零约束上,但是我不确定如何在Python中实现相同的目标.
That adds the sum to zero constraint to both C and D but I am not sure how to achieve the same in Python.
我已经看到有一个fit_constraint()
方法可用,但是我不太确定如何使用它,或者甚至不能正确使用它来实现我的要求.
I have seen that there is a fit_constraint()
method available but I am not too sure how to use it or if it is even the right thing to use to achieve what I require.
任何人都可以为应用此约束提供任何建议吗?
Can anyone offer any advice to applying this constraint?
以下是使用高斯族说明fit_constrained
的示例,因为我没有很快找到带有分类变量的Poisson示例
Here is an example to illustrate fit_constrained
, using Gaussian family since I didn't quickly find a Poisson example with categorical variables
import pandas
import statsmodels.api as sm
from statsmodels.formula.api import glm
url = 'http://www.ats.ucla.edu/stat/data/hsb2.csv'
hsb2 = pandas.read_table(url, delimiter=",")
mod = glm("write ~ C(race) - 1", data=hsb2)
res = mod.fit()
print(res.summary())
约束所有系数加到零
res_c = mod.fit_constrained('C(race)[1] + C(race)[2] + C(race)[3] + C(race)[4] = 0')
print(res_c.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: write No. Observations: 200
Model: GLM Df Residuals: 197
Model Family: Gaussian Df Model: 2
Link Function: identity Scale: 1232.08314649
Method: IRLS Log-Likelihood: -993.41
Date: Wed, 25 Mar 2015 Deviance: 2.4149e+05
Time: 16:42:37 Pearson chi2: 2.41e+05
No. Iterations: 1
==============================================================================
coef std err z P>|z| [95.0% Conf. Int.]
------------------------------------------------------------------------------
C(race)[1] 1.0002 221.565 0.005 0.996 -433.260 435.260
C(race)[2] -41.1814 267.253 -0.154 0.878 -564.988 482.626
C(race)[3] -6.3498 235.771 -0.027 0.979 -468.453 455.754
C(race)[4] 46.5311 100.184 0.464 0.642 -149.827 242.889
==============================================================================
Model has been estimated subject to linear equality constraints.
约束以逗号分隔,默认为零:
constraints are comma separated and default to equal zero:
res_c2 = mod.fit_constrained('C(race)[1] + C(race)[2], C(race)[3] + C(race)[4]')
print(res_c2.summary())
最后打印
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: write No. Observations: 200
Model: GLM Df Residuals: 198
Model Family: Gaussian Df Model: 1
Link Function: identity Scale: 1438.99574167
Method: IRLS Log-Likelihood: -1008.9
Date: Wed, 25 Mar 2015 Deviance: 2.8204e+05
Time: 16:42:37 Pearson chi2: 2.82e+05
No. Iterations: 1
==============================================================================
coef std err z P>|z| [95.0% Conf. Int.]
------------------------------------------------------------------------------
C(race)[1] 13.6286 242.003 0.056 0.955 -460.689 487.946
C(race)[2] -13.6286 242.003 -0.056 0.955 -487.946 460.689
C(race)[3] -41.6606 111.458 -0.374 0.709 -260.115 176.794
C(race)[4] 41.6606 111.458 0.374 0.709 -176.794 260.115
==============================================================================
Model has been estimated subject to linear equality constraints.
我不确定patsy公式的工作方式,以便在存在多个分类解释变量的情况下不会删除任何级别.
I'm not sure how patsy formulas work so that none of the levels is dropped if there are several categorical explanatory variables.