运用标签云扩展自己的应用

使用标签云扩展自己的应用
标签云或文字云是关键词的视觉化描述，用于汇总用户生成的标签或一个网站的文字内容。标签一般是独立的词汇，常常按字母顺序排列，其重要程度又能通过改变字体大小或颜色来表现。所以标签云可以灵活地依照字序或热门程度来检索一个标签。大多数标签本身就是超级链接，直接指向与标签相联的一系列条目。

互联网标签云（Tag Cloud）的概念最早由Stewart Butterfield在《Make a Flickr-Style Tag》一文中提出。在那些用户分享频繁的web2.0网站，比如Flickr、Del.icio.us和Technorati中得到了广泛的应用。

简要总结了一下，标签云的作用主要有以下三类：

热门推荐
自动分类
数据分析

比如下图展示了与Web2.0相关的一组标签云，字体的大小参照相关程度：
运用标签云扩展自己的应用

通过上图我们可以看出，关注Web2.0的人们，可能还会对博客、用户参与、长尾理论等关键词感兴趣。

下面的TagCloud可视化地描绘了全球的人口分布，字体的大小参照各国的人口数量：
运用标签云扩展自己的应用

通过上图，我们可以清晰地看出，中国和印度是世界上人口最多的两个国家；排在其后的人口大国有美国、尼日利亚、巴基斯坦等等。

标签云也可以用作数据分析，比如下面这张有趣的图，分别摘自布什总统在2002年以及奥巴马总统在2011年的政府报告。这里，字体的大小与关键词的TF（词频）以及IDF（反文档频率）相关，即某个关键词出现的频率越高，而在其它文档中出现的频率越低，就认为该词的权重越大：
运用标签云扩展自己的应用

从两份报告总可以看出，911发生之后，布什更强调安全、反恐、武器等，而在经济危机来临之时，奥巴马则更强调工作、税收、商务等。

关于实现一个TagCloud的代码架构，可以参考下图：
运用标签云扩展自己的应用

其中各类的声明如下所述：

name	description
Tag	标签实体类，考虑到英语语法的特性，包含词语和词根两个属性
TagCache	缓存所有Tag
TagMagnitude	Tag的模
TagMagnitudeVector	Tag矢量的模，一个TagMagnitudeVector包含多个TagMagnitude
TagMagnitudeVector	Tag矢量的模，一个TagMagnitudeVector包含多个TagMagnitude
InverseDocFreqEstimator	IDF的计算
HTMLTagCloudDecorator	实现动态字体大小设置的HTMl装饰类

核心代码如下：

public class LuceneTextAnalyzer implements TextAnalyzer {
    
    public List<Tag> analyzeText(String text) throws IOException {
        Reader reader = new StringReader(text);
        Analyzer analyzer = getAnalyzer();//获取继承自Lucene的Analyzer
        List<Tag> tags = new ArrayList<Tag>();
        TokenStream tokenStream = analyzer.tokenStream(null, reader) ;
        Token token = tokenStream.next();
        while ( token != null) {
            Tag tag = getTag(token.termText());
            tags.add(tag);
            token = tokenStream.next();
        }
        return tags;
    }
    
    public TagMagnitudeVector createTagMagnitudeVector(String text) throws IOException {
        List<Tag> tagList =  analyzeText(text);
        Map<Tag,Integer> tagFreqMap = computeTermFrequency(tagList);//计算TF
        return applyIDF(tagFreqMap);//计算IDF
    }
    
  
    private Map<Tag,Integer> computeTermFrequency(List<Tag> tagList) {
        Map<Tag,Integer> tagFreqMap = new HashMap<Tag,Integer>();
        for (Tag tag: tagList) {
            Integer count = tagFreqMap.get(tag);
            if (count == null) {
                count = new Integer(1);
            } else {
                count = new Integer(count.intValue() + 1);
            }
            tagFreqMap.put(tag, count);
        }
        return tagFreqMap;
    }
    
    private TagMagnitudeVector applyIDF(Map<Tag,Integer> tagFreqMap) {
        List<TagMagnitude> tagMagnitudes = new ArrayList<TagMagnitude>();
        for (Tag tag: tagFreqMap.keySet()) {
            double idf = this.inverseDocFreqEstimator.estimateInverseDocFreq(tag);
            double tf = tagFreqMap.get(tag);
            double wt = tf*idf;
            tagMagnitudes.add(new TagMagnitudeImpl(tag,wt));
        }
        return new TagMagnitudeVectorImpl(tagMagnitudes);
    }
    
    private TagCloud createTagCloud(TagMagnitudeVector tmVector) {
        List<TagCloudElement> elements = new ArrayList<TagCloudElement>();
        for (TagMagnitude tm: tmVector.getTagMagnitudes()) {
            TagCloudElement element = new TagCloudElementImpl(tm.getDisplayText(), tm.getMagnitude());
            elements.add(element);
        }
        return new TagCloudImpl(elements, new LinearFontSizeComputationStrategy(3,"font-size: "));//使用线性的字体计算策略，权重加一级，字体增大一号
    }
    
    private String visualizeTagCloud(TagCloud tagCloud) {
        HTMLTagCloudDecorator decorator = new HTMLTagCloudDecorator();//使用HTML包装内容
        String html = decorator.decorateTagCloud(tagCloud);
        System.out.println(html);
        return html;
    }

    public static void main(String [] args) throws IOException {
        String title = "Collective Intelligence and Web2.0";
        String body = "Web2.0 is all about connecting users to users, "  +
                " inviting users to participate and applying their collective" +
                " intelligence to improve the application. Collective intelligence" +
                " enhances the user experience" ;
                
        TagCacheImpl t = new TagCacheImpl();
        InverseDocFreqEstimator idfEstimator = new EqualInverseDocFreqEstimator();
        LuceneTextAnalyzer lta = new LuceneTextAnalyzer(t, idfEstimator);
//        System.out.print("Analyzing the title .... \n");
//        lta.displayTextAnalysis(title);
//        System.out.print("\nAnalyzing the body .... \n");
//        lta.displayTextAnalysis(body);
        TagMagnitudeVector tmTitle = lta.createTagMagnitudeVector(title);
        TagMagnitudeVector tmBody = lta.createTagMagnitudeVector(body);
        TagMagnitudeVector tmCombined = tmTitle.add(tmBody);
        System.out.println(tmCombined);    
        TagCloud tagCloud = lta.createTagCloud(tmBody);
        String html = lta.visualizeTagCloud(tagCloud);
        System.out.println("html:" + html);
    }
    
    private void displayTextAnalysis(String text) throws IOException {
        List<Tag> tags = analyzeText(text);
        for (Tag tag: tags) {
            System.out.print(tag + " ");
        }      
    }
}

运用标签云扩展自己的应用

相关推荐