将ols函数与包含数字/空格的参数一起使用

问题描述：

使用statsmodels.formula.api函数遇到很多困难

I am having a lot of difficulty using the statsmodels.formula.api function

       ols(formula,data).fit().rsquared_adj

由于我的预测变量名称的性质.预测变量中有明显不喜欢的数字和空格等.我了解我需要使用patsy.builtins.Q之类的东西因此，假设我的预测变量为weight.in.kg，则应按如下所示输入它:

due to the nature of the names of my predictors. The predictors have numbers and spaces etc in them which it clearly doesn't like. I understand that I need to use something like patsy.builtins.Q So lets say my predictor would be weight.in.kg , it should be entered as follows:

Q("weight.in.kg")

所以我需要从列表中获取公式，并且使用patsy.builtin.Q

so I need to take my formula from a list, and the difficulty arises in modifying every item in the list with this patsy.builtin.Q

formula = "{} ~ {} + 1".format(response, ' + '.join([candidate])

[候选]是我的预测变量列表.

with [candidate] being my list of predictors.

最亲爱的python专家，我向您提出的问题是，我到底该如何将[候选]列表中的每个单独项目放在以下表达式的引号内:

My question to you, dearest python experts, is how on earth do I put every individual item in the list [candidate] within the quotes in the following expression:

Q('')

以便ols函数可以实际读取它?抱歉，如果这很明显，我对python不好.

so that the ols function can actually read it? Apologies if this is super obvious, me no good at python.

答

现在，您将从要在公式中使用的术语列表开始，然后尝试将它们粘贴到一个复杂的字符串中，patsy将对其进行解析并转换回条款列表.您可以看到patsy为这种公式生成的数据结构( ModelDesc.from_formula 是patsy的解析器):

Right now you're starting with a list of terms that you want in your formula, then trying to paste them together into a complicated string, which patsy will parse and convert back into a list of terms. You can see the data structure that patsy generates for this kind of formula (ModelDesc.from_formula is patsy's parser):

In [7]: from patsy import ModelDesc

In [8]: ModelDesc.from_formula("y ~ x1 + x2 + x3")
Out[8]: 
ModelDesc(lhs_termlist=[Term([EvalFactor('y')])],
          rhs_termlist=[Term([]),
                        Term([EvalFactor('x1')]),
                        Term([EvalFactor('x2')]),
                        Term([EvalFactor('x3')])])

这看起来有些吓人，但实际上非常简单-您有一个 ModelDesc ，它代表一个公式，并且左侧有一个术语列表，右侧有一个-术语表.每个术语都由一个 Term 对象表示，并且每个 Term 都有一个因子列表.(这里每个术语只有一个因素-如果您进行任何交互，则这些术语将具有多个因素.)此外，空交互" Term([])是patsy表示截距的方式学期.

This might look a little intimidating, but it's pretty simple really -- you have a ModelDesc, which represents a single formula, and it has a left-hand-side list of terms and a right-hand-side list of terms. Each term is represented by a Term object, and each Term has a list of factors. (Here each term just has a single factor -- if you had any interactions then those terms would have multiple factors.) Also, the "empty interaction" Term([]) is how patsy represents the intercept term.

因此，您可以通过直接创建所需的术语并将其传递给patsy，跳过字符串解析步骤，来避免所有这些复杂的引用/解析工作

So you can avoid all this complicated quoting/parsing stuff by directly creating the terms you want and passing them to patsy, skipping the string parsing step

from patsy import ModelDesc, Term, LookupFactor

response_terms = [Term([LookupFactor(response)])]
# start with intercept...
model_terms = [Term([])]
# ...then add another term for each candidate
model_terms += [Term([LookupFactor(c)]) for c in candidates]
model_desc = ModelDesc(response_terms, model_terms)

现在，您可以将该 model_desc 对象传递到通常传递patsy公式的任何函数中:

and now you can pass that model_desc object into any function where you'd normally pass a patsy formula:

ols(model_desc, data).fit().rsquared_adj

这里还有另一个技巧:您会注意到第一个示例具有 EvalFactor 对象，现在我们改用 LookupFactor 对象.区别在于 EvalFactor 接受一串任意的Python代码，如果您想编写类似 np.log(x1)之类的东西，这很好，但是如果您拥有名称为 weight.in.kg 的变量. LookupFactor 直接采用变量名称在您的数据中查找，因此不需要进一步的引用.

There's another trick here: you'll notice that the first example has EvalFactor objects, and now we're using LookupFactor objects instead. The difference is that EvalFactor takes a string of arbitrary Python code, which is nice if you want to write something like np.log(x1), but really annoying if you have variables with name like weight.in.kg. LookupFactor directly takes the name of a variable to look up in your data, so no further quoting is needed.

或者，您可以通过一些更高级的Python字符串处理来做到这一点，例如:

Alternatively, you could do this with some fancier Python string processing, like:

quoted = ["Q('{}')".format(c) for c in candidates]
formula = "{} ~ {} + 1".format(response, ' + '.join(quoted))

但是，尽管开始时它比较简单，但它却更脆弱-例如，考虑(或尝试)如果您的参数之一包含引号会发生什么！您应该从不在处理管道中编写这样的内容，其中候选名称来自您无法控制的其他地方(例如随机CSV文件)-您可以执行各种任意代码.上面的解决方案避免了所有这些问题.

But while this is a bit simpler to start with, it's much more fragile -- for example, think about (or try) what happens if one of your parameters contains a quote character! You should never write something like this in a processing pipeline where the candidate names come from somewhere else that you can't control (e.g. a random CSV file) -- you could get all kinds of arbitrary code execution. The solution above avoids all of these problems.

参考:

https://patsy.readthedocs.io/en/latest/expert-model-specification.html
https://patsy.readthedocs.io/en/latest/formulas.html

将ols函数与包含数字/空格的参数一起使用

相关推荐