将ols函数与包含数字/空格的参数一起使用
使用statsmodels.formula.api函数遇到很多困难
I am having a lot of difficulty using the statsmodels.formula.api function
ols(formula,data).fit().rsquared_adj
由于我的预测变量名称的性质.预测变量中有明显不喜欢的数字和空格等.我了解我需要使用patsy.builtins.Q之类的东西因此,假设我的预测变量为weight.in.kg,则应按如下所示输入它:
due to the nature of the names of my predictors. The predictors have numbers and spaces etc in them which it clearly doesn't like. I understand that I need to use something like patsy.builtins.Q So lets say my predictor would be weight.in.kg , it should be entered as follows:
Q("weight.in.kg")
所以我需要从列表中获取公式,并且使用patsy.builtin.Q
so I need to take my formula from a list, and the difficulty arises in modifying every item in the list with this patsy.builtin.Q
formula = "{} ~ {} + 1".format(response, ' + '.join([candidate])
[候选]是我的预测变量列表.
with [candidate] being my list of predictors.
最亲爱的python专家,我向您提出的问题是,我到底该如何将[候选]列表中的每个单独项目放在以下表达式的引号内:
My question to you, dearest python experts, is how on earth do I put every individual item in the list [candidate] within the quotes in the following expression:
Q('')
以便ols函数可以实际读取它?抱歉,如果这很明显,我对python不好.
so that the ols function can actually read it? Apologies if this is super obvious, me no good at python.
现在,您将从要在公式中使用的术语列表开始,然后尝试将它们粘贴到一个复杂的字符串中,patsy将对其进行解析并转换回条款列表.您可以看到patsy为这种公式生成的数据结构( ModelDesc.from_formula
是patsy的解析器):
Right now you're starting with a list of terms that you want in your formula, then trying to paste them together into a complicated string, which patsy will parse and convert back into a list of terms. You can see the data structure that patsy generates for this kind of formula (ModelDesc.from_formula
is patsy's parser):
In [7]: from patsy import ModelDesc
In [8]: ModelDesc.from_formula("y ~ x1 + x2 + x3")
Out[8]:
ModelDesc(lhs_termlist=[Term([EvalFactor('y')])],
rhs_termlist=[Term([]),
Term([EvalFactor('x1')]),
Term([EvalFactor('x2')]),
Term([EvalFactor('x3')])])
这看起来有些吓人,但实际上非常简单-您有一个 ModelDesc
,它代表一个公式,并且左侧有一个术语列表,右侧有一个-术语表.每个术语都由一个 Term
对象表示,并且每个 Term
都有一个因子列表.(这里每个术语只有一个因素-如果您进行任何交互,则这些术语将具有多个因素.)此外,空交互" Term([])
是patsy表示截距的方式学期.
This might look a little intimidating, but it's pretty simple really -- you have a ModelDesc
, which represents a single formula, and it has a left-hand-side list of terms and a right-hand-side list of terms. Each term is represented by a Term
object, and each Term
has a list of factors. (Here each term just has a single factor -- if you had any interactions then those terms would have multiple factors.) Also, the "empty interaction" Term([])
is how patsy represents the intercept term.
因此,您可以通过直接创建所需的术语并将其传递给patsy,跳过字符串解析步骤,来避免所有这些复杂的引用/解析工作
So you can avoid all this complicated quoting/parsing stuff by directly creating the terms you want and passing them to patsy, skipping the string parsing step
from patsy import ModelDesc, Term, LookupFactor
response_terms = [Term([LookupFactor(response)])]
# start with intercept...
model_terms = [Term([])]
# ...then add another term for each candidate
model_terms += [Term([LookupFactor(c)]) for c in candidates]
model_desc = ModelDesc(response_terms, model_terms)
现在,您可以将该 model_desc
对象传递到通常传递patsy公式的任何函数中:
and now you can pass that model_desc
object into any function where you'd normally pass a patsy formula:
ols(model_desc, data).fit().rsquared_adj
这里还有另一个技巧:您会注意到第一个示例具有 EvalFactor
对象,现在我们改用 LookupFactor
对象.区别在于 EvalFactor
接受一串任意的Python代码,如果您想编写类似 np.log(x1)
之类的东西,这很好,但是如果您拥有名称为 weight.in.kg
的变量. LookupFactor
直接采用变量名称在您的数据中查找,因此不需要进一步的引用.
There's another trick here: you'll notice that the first example has EvalFactor
objects, and now we're using LookupFactor
objects instead. The difference is that EvalFactor
takes a string of arbitrary Python code, which is nice if you want to write something like np.log(x1)
, but really annoying if you have variables with name like weight.in.kg
. LookupFactor
directly takes the name of a variable to look up in your data, so no further quoting is needed.
或者,您可以通过一些更高级的Python字符串处理来做到这一点,例如:
Alternatively, you could do this with some fancier Python string processing, like:
quoted = ["Q('{}')".format(c) for c in candidates]
formula = "{} ~ {} + 1".format(response, ' + '.join(quoted))
但是,尽管开始时它比较简单,但它却更脆弱-例如,考虑(或尝试)如果您的参数之一包含引号会发生什么!您应该从不在处理管道中编写这样的内容,其中候选名称来自您无法控制的其他地方(例如随机CSV文件)-您可以执行各种任意代码.上面的解决方案避免了所有这些问题.
But while this is a bit simpler to start with, it's much more fragile -- for example, think about (or try) what happens if one of your parameters contains a quote character! You should never write something like this in a processing pipeline where the candidate names come from somewhere else that you can't control (e.g. a random CSV file) -- you could get all kinds of arbitrary code execution. The solution above avoids all of these problems.
参考: