在ggplot2中将密度线添加到带有计数数据的直方图

问题描述:

我想在直方图中添加一条密度线(实际上是正常密度).

I want to add a density line (a normal density actually) to a histogram.

假设我有以下数据.我可以通过ggplot2绘制直方图:

Suppose I have the following data. I can plot the histogram by ggplot2:

set.seed(123)    
df <- data.frame(x = rbeta(10000, shape1 = 2, shape2 = 4))

ggplot(df, aes(x = x)) + geom_histogram(colour = "black", fill = "white", 
                                        binwidth = 0.01) 

我可以使用以下方法添加密度线:

I can add a density line using:

ggplot(df, aes(x = x)) + 
  geom_histogram(aes(y = ..density..),colour = "black", fill = "white", 
                 binwidth = 0.01) + 
  stat_function(fun = dnorm, args = list(mean = mean(df$x), sd = sd(df$x)))

但这不是我真正想要的,我希望此密度线适合计数数据.

But this is not what I actually want, I want this density line to be fitted to the count data.

我发现了类似的帖子( HERE )为该问题提供了解决方案.但这在我的情况下不起作用.我需要一个任意的扩展因子才能得到我想要的.这一点根本无法推广:

I found a similar post (HERE) that offered a solution to this problem. But it did not work in my case. I need to an arbitrary expansion factor to get what I want. And this is not generalizable at all:

ef <- 100 # Expansion factor

ggplot(df, aes(x = x)) + 
  geom_histogram(colour = "black", fill = "white", binwidth = 0.01) + 
  stat_function(fun = function(x, mean, sd, n){ 
    n * dnorm(x = x, mean = mean, sd = sd)}, 
    args = list(mean = mean(df$x), sd = sd(df$x), n = ef))

我可以用来概括这一点的任何线索

Any clues that I can use to generalize this

  • 先服从正态分布
  • 然后设置为其他任意大小的容器,
  • 最后对其他任何发行版都将非常有帮助.

魔术师不会发生分配功能.您必须明确地做到这一点.一种方法是在MASS程序包中使用fitdistr(...).

Fitting a distribution function does not happen by magic. You have to do it explicitly. One way is using fitdistr(...) in the MASS package.

library(MASS)    # for fitsidtr(...)
# excellent fit (of course...)
ggplot(df, aes(x = x)) + 
  geom_histogram(aes(y=..density..),colour = "black", fill = "white", binwidth = 0.01)+
  stat_function(fun=dbeta,args=fitdistr(df$x,"beta",start=list(shape1=1,shape2=1))$estimate)

# horrible fit - no surprise here
ggplot(df, aes(x = x)) + 
  geom_histogram(aes(y=..density..),colour = "black", fill = "white", binwidth = 0.01)+
  stat_function(fun=dnorm,args=fitdistr(df$x,"normal")$estimate)

# mediocre fit - also not surprising...
ggplot(df, aes(x = x)) + 
  geom_histogram(aes(y=..density..),colour = "black", fill = "white", binwidth = 0.01)+
  stat_function(fun=dgamma,args=fitdistr(df$x,"gamma")$estimate)

编辑:对OP的评论的回复.

EDIT: Response to OP's comment.

比例因子为binwidth✕样本量.

The scale factor is binwidth ✕ sample size.

ggplot(df, aes(x = x)) + 
  geom_histogram(colour = "black", fill = "white", binwidth = 0.01)+
  stat_function(fun=function(x,shape1,shape2)0.01*nrow(df)*dbeta(x,shape1,shape2),
                args=fitdistr(df$x,"beta",start=list(shape1=1,shape2=1))$estimate)