防止搜索滥用
I am unable to google something useful on this subject, so I'd appreciate either links to articles that deal in this subject, or direct answers here, either is fine.
I am implementing a search system in PHP/MySQL on a site that has quite a lot of visitors, so I am going to implement some restrictions to the length of the characters a visitor is allowed to enter in the search field and the minimum time required between two searches. Since I'm kind of new to these problems and I don't really know the "real reasons" why this is usually done, it's only my assumptions that the character minimum length is implemented to minimize the number of results the database will return, and the time between searches is implemented to prevent robots from spamming the search system and slowing down the site. Is that right?
And finally, the question of how to implement the minimum time between two searches. The solution i came up with, in pseudo-code, is this
- Set a test cookie at the URL where the search form is submitted to
- Redirect user to the URL where the search results should be output
- Check if the test cookie exists
- If not, output a warning that he isn't allowed to use the search system (is probably a robot)
- Check if a cookie exists that tells the time of the last search
- If this was less that 5 seconds ago, output a warning that he should wait before searching again
- Search
- Set a cookie with the time of last search to current time
- Output search results
Is this the best way to do it?
I understand this means visitors that have cookies disabled will not be able to use the search system, but is that really a problem these days? I couldn't find the statistics for 2012, but I managed to find data saying 3.7% of people had disabled cookies in 2009. That doesn't seem like a lot and I suppose should probably be even less these days.
我无法谷歌在这个主题上有用的东西,所以我很欣赏这篇文章的链接 这里的主题或直接答案都很好。 p>
我在一个拥有大量访问者的网站上用PHP / MySQL实现一个搜索系统,所以我要实现一些 允许访问者在搜索字段中输入的字符长度的限制以及两次搜索之间所需的最短时间。 由于我对这些问题并不熟悉,而且我真的不知道为什么通常会这样做的“真正原因”,我只能假设实现字符最小长度以最小化数据库返回的结果数量, 并且实施搜索之间的时间以防止机器人向搜索系统发送垃圾邮件并减慢网站速度。 是吗? p>
最后,关于如何实现两次搜索之间的最短时间的问题。 我用伪代码提出的解决方案是 p>
- 在提交搜索表单的URL处设置测试cookie li>
- 将用户重定向到应输出搜索结果的网址 li>
- 检查测试Cookie是否存在
- 如果不存在,则输出警告他不是 允许使用搜索系统(可能是机器人) li> ul> li>
- 检查是否存在告知上次搜索时间的cookie
- 如果这比5秒前少,请输出警告他应该再等一下再搜索 li> ul> li>
- 搜索 li>
- 将上次搜索时间的Cookie设置为当前时间 li>
- 输出搜索结果 li>
ol>
这是最佳方式吗? 做到了吗? p>
据我所知,这意味着已禁用Cookie的访问者将无法使用搜索系统,但这些日子真的存在问题吗? 我找不到2012年的统计数据,但我设法找到了数据,说2009年有3.7%的人禁用了cookie。这似乎不是很多,我想这些日子可能应该更少。 p> div>
"only my assumptions that the character minimum length is implemented to minimize the number of results the database will return". Your assumption is absolutely correct. It reduces the number of potential results, by forcing the user to think about, what it is they wish to search.
As far as bots spamming your search, you could implement a captcha, the most frequently used is recaptcha. If you don't want to show a captcha right away, you can track (via session) the number of times the user submitted search, and if X amount of searches occur within a certain time frame, then render the captcha.
I've seen sites like SO and thechive.com implement this type of strategy, where captcha isn't rendered right away, but will be rendered if a threshold is encountered.
This way you're preventing Search Engine from indexing your search results. A cleaner way of doing this would be:
- Get IP where search originated
- Store that IP in a cache system such as memcached and the time that query was made
- If another query is sent from same IP and less then x second passed simply reject it or make the user wait
Another thing you can do to increase performance is to take a look at analytics and see which queries are made most often and cache those so when a request comes in you serve the cached version and not make a full db query, parsing, etc...
Another naive option would be to have a script run 1-2 times a day running all common queries and create static HTML files that users hit when making particular search queries instead of hitting the db.