导入word到Fckeditor(java兑现)
导入word到Fckeditor(java实现)
最近项目可以说到达了一个里程碑,借这篇文章把前面的技术进行总结.
我们的项目是给一个政府单位开发的,后台其实是个CMS系统,客户非要完成一个功能就是把WORD直接导入到Web 编辑器中,我们是用的是Fckeditor2.5版本,这个功能让我很头疼,想了几天没有思路,但是忽然看到了网上的一篇文章 地址如下:
http://topic.****.net/u/20091020/21/b77f825b-4a18-4a86-b642-8d38ffef9e12.html
3楼的哥们把代码贴了上了,不错的思路。
首先用调用COM组件把Word转为html ,然后通过截取重要的源代码 ,最后把这代码放到fck编辑器中,我在做的中间还遇到了很多技术细节问题,下面来看我的实现
使用jacob 来把word转成html
最近项目可以说到达了一个里程碑,借这篇文章把前面的技术进行总结.
我们的项目是给一个政府单位开发的,后台其实是个CMS系统,客户非要完成一个功能就是把WORD直接导入到Web 编辑器中,我们是用的是Fckeditor2.5版本,这个功能让我很头疼,想了几天没有思路,但是忽然看到了网上的一篇文章 地址如下:
http://topic.****.net/u/20091020/21/b77f825b-4a18-4a86-b642-8d38ffef9e12.html
3楼的哥们把代码贴了上了,不错的思路。
首先用调用COM组件把Word转为html ,然后通过截取重要的源代码 ,最后把这代码放到fck编辑器中,我在做的中间还遇到了很多技术细节问题,下面来看我的实现
使用jacob 来把word转成html
Java代码 /** * 把word文件转换成html文件 * * @param src * 原文件 * @param out * 目标文件 */ public static synchronized void word2Html(String src, String out) { ActiveXComponent app = null; try { app = new ActiveXComponent("Word.Application");// 启动word app.setProperty("Visible", new Variant(false)); // 设置word不可见 Dispatch docs = app.getProperty("Documents").toDispatch(); Dispatch doc = Dispatch.invoke(docs, "Open", Dispatch.Method, new Object[] { src, new Variant(false), new Variant(true) }, new int[1]). toDispatch(); // 打开word文件 8转为 html 9转为 mht Dispatch.invoke(doc, "SaveAs", Dispatch.Method, new Object [] {out, new Variant(8) }, new int[1]); Variant f = new Variant(false); Dispatch.call(doc, "Close", f); } catch (Exception e) { e.printStackTrace(); } finally { // 注意这里一定 要关闭否则服务器端会有很多winword.exe进程 app.invoke("Quit", new Variant[] {}); app = null; } } [/color] Java代码 /** * 把word文件转换成html文件 * * @param src * 原文件 * @param out * 目标文件 */ public static synchronized void word2Html(String src, String out) { ActiveXComponent app = null; try { app = new ActiveXComponent("Word.Application");// 启动word app.setProperty("Visible", new Variant(false)); // 设置word不可见 Dispatch docs = app.getProperty("Documents").toDispatch(); Dispatch doc = Dispatch.invoke(docs, "Open", Dispatch.Method, new Object[] { src, new Variant(false), new Variant(true) }, new int[1]). toDispatch(); // 打开word文件 8转为 html 9转为 mht Dispatch.invoke(doc, "SaveAs", Dispatch.Method, new Object [] {out, new Variant(8) }, new int[1]); Variant f = new Variant(false); Dispatch.call(doc, "Close", f); } catch (Exception e) { e.printStackTrace(); } finally { // 注意这里一定 要关闭否则服务器端会有很多winword.exe进程 app.invoke("Quit", new Variant[] {}); app = null; } } !-----------------------------> [color=green]上面的代码其实完成的功能其实就是通过调用COM组件打开word程序然后隐藏窗口然后把打开的word文件另存为html. 2.用Apache的CommonsIO读取文件 Java代码 /** * 根据文件名读取出html代码 * * @param fileName * @return */ public static synchronized String getHtmlCode(String fileName) { InputStream in = null; String result = null; try { in = new FileInputStream(fileName); result = IOUtils.toString(in, "gb2312"); } catch (Exception e) { e.printStackTrace(); } finally { IOUtils.closeQuietly(in); } return result; } Java代码 /** * 根据文件名读取出html代码 * * @param fileName * @return */ public static synchronized String getHtmlCode(String fileName) { InputStream in = null; String result = null; try { in = new FileInputStream(fileName); result = IOUtils.toString(in, "gb2312"); } catch (Exception e) { e.printStackTrace(); } finally { IOUtils.closeQuietly(in); } return result; } [/color] !---------------------------> [color=blue]默认转成的html文件就是gb2312编码的 这里注意你读取出来的字符串必须是包含空格的,意思就是把读取出来的字符串拷出来放到文本文档里面的代码和html的源代码格式完全一样. 3.截取body代码 Java代码 /** * 截取body内容 * * @param bodyCode * @return */ public static synchronized String performBodyCode(String htmlCode) { String bodyCode = ""; // 处理body int bodyIndex = htmlCode.indexOf("<body"); int bodyEndIndex = htmlCode.indexOf("</html>"); if (bodyIndex != -1 && bodyEndIndex != -1) { htmlCode = htmlCode.substring(bodyIndex, bodyEndIndex); //bodyCode = StringUtils.replace(htmlCode, "v:imagedata", "img"); //bodyCode = StringUtils.replace(bodyCode, "</v:imagedata>", ""); bodyCode=htmlCode; } htmlCode = null; return bodyCode; } [/color] !---------------------------------> Java代码 /** * 截取body内容 * * @param bodyCode * @return */ public static synchronized String performBodyCode(String htmlCode) { String bodyCode = ""; // 处理body int bodyIndex = htmlCode.indexOf("<body"); int bodyEndIndex = htmlCode.indexOf("</html>"); if (bodyIndex != -1 && bodyEndIndex != -1) { htmlCode = htmlCode.substring(bodyIndex, bodyEndIndex); //bodyCode = StringUtils.replace(htmlCode, "v:imagedata", "img"); //bodyCode = StringUtils.replace(bodyCode, "</v:imagedata>", ""); bodyCode=htmlCode; } htmlCode = null; return bodyCode; } !-------------------------------> [color=indigo] 转成的html代码中很多一部分是无用的代码 我们需要对他进行减肥 已经标签的替换. 4.处理html代码中的style标签 Java代码 /** * 处理Style标签中的内容 * * @param htmlCode * @return */ public static synchronized String performStyleCode(String htmlCode) { String result = ""; int index = 0; int styleStartIndex = 0; int styleEndIndex = 0; // 截取<style>标签中开始部分的坐标 while (index < htmlCode.length()) { int styleIndexStartTemp = htmlCode.indexOf("<style>", index); if (styleIndexStartTemp == -1) { break; } int styleContentStartIndex = htmlCode.indexOf("<!--", styleIndexStartTemp); if (styleContentStartIndex - styleIndexStartTemp == 9) { styleStartIndex = styleIndexStartTemp; break; } index = styleIndexStartTemp + 7; } // 截取style标签中后面部分的坐标 index = 0; while (index < htmlCode.length()) { int styleContentEndIndex = htmlCode.indexOf("-->", index); if (styleContentEndIndex == -1) { break; } int styleEndIndexTemp = htmlCode.indexOf("</style>", styleContentEndIndex); if (styleEndIndexTemp - styleContentEndIndex == 5) { styleEndIndex = styleEndIndexTemp; break; } index = styleContentEndIndex + 4; } result = htmlCode.substring(styleStartIndex, styleEndIndex + 8); return result; } [/color] !------------------------------> Java代码 /** * 处理Style标签中的内容 * * @param htmlCode * @return */ public static synchronized String performStyleCode(String htmlCode) { String result = ""; int index = 0; int styleStartIndex = 0; int styleEndIndex = 0; // 截取<style>标签中开始部分的坐标 while (index < htmlCode.length()) { int styleIndexStartTemp = htmlCode.indexOf("<style>", index); if (styleIndexStartTemp == -1) { break; } int styleContentStartIndex = htmlCode.indexOf("<!--", styleIndexStartTemp); if (styleContentStartIndex - styleIndexStartTemp == 9) { styleStartIndex = styleIndexStartTemp; break; } index = styleIndexStartTemp + 7; } // 截取style标签中后面部分的坐标 index = 0; while (index < htmlCode.length()) { int styleContentEndIndex = htmlCode.indexOf("-->", index); if (styleContentEndIndex == -1) { break; } int styleEndIndexTemp = htmlCode.indexOf("</style>", styleContentEndIndex); if (styleEndIndexTemp - styleContentEndIndex == 5) { styleEndIndex = styleEndIndexTemp; break; } index = styleContentEndIndex + 4; } result = htmlCode.substring(styleStartIndex, styleEndIndex + 8); return result; } /** * 处理Style标签中的内容 * * @param htmlCode * @return */ public static synchronized String performStyleCode(String htmlCode) { String result = ""; int index = 0; int styleStartIndex = 0; int styleEndIndex = 0; // 截取<style>标签中开始部分的坐标 while (index < htmlCode.length()) { int styleIndexStartTemp = htmlCode.indexOf("<style>", index); if (styleIndexStartTemp == -1) { break; } int styleContentStartIndex = htmlCode.indexOf("<!--", styleIndexStartTemp); if (styleContentStartIndex - styleIndexStartTemp == 9) { styleStartIndex = styleIndexStartTemp; break; } index = styleIndexStartTemp + 7; } // 截取style标签中后面部分的坐标 index = 0; while (index < htmlCode.length()) { int styleContentEndIndex = htmlCode.indexOf("-->", index); if (styleContentEndIndex == -1) { break; } int styleEndIndexTemp = htmlCode.indexOf("</style>", styleContentEndIndex); if (styleEndIndexTemp - styleContentEndIndex == 5) { styleEndIndex = styleEndIndexTemp; break; } index = styleContentEndIndex + 4; } result = htmlCode.substring(styleStartIndex, styleEndIndex + 8); return result; } word转为html后里面有很多的style标签 其中 <style> <!--- 内容省略 ---> <style> 类似于如上带html注释的style标签才是有用的 其余全是无用的.上面的代码就是把这有用的代码截取出来.如果你在第2部的时候格式读取正确,那么上面的代码截取出来的代码肯定没问题. !---------------------------------> [color=indigo]5.处理word文件中的图片 Java代码 /** * 处理body中的图片内容 * @param bodyContent * @return */ public static synchronized String performBodyImg(String bodyContent) { //根据图片名称预览图片action的地址 String newImgSrc = "tumbnail.action?fileName="; //存放word文件的物理位置 String filePath = ResourceBundle.getBundle("sysConfig").getString("userFilePath.word"); //存放图片的物理位置 String imgPath = ResourceBundle.getBundle("sysConfig").getString("userFilePath.image"); Parser parser = Parser.createParser(bodyContent, "gb2312"); ImgTagVisitor imgTag = new ImgTagVisitor(); try { parser.visitAllNodesWith(imgTag); // 得到所有图片地址 List<String> imgUrls = imgTag.getSrcStringList(); for (String url : imgUrls) { String uuid = UUID.randomUUID().toString(); String extName = url.substring(url.lastIndexOf(".")); String newImgFileName = newImgSrc + uuid + extName; bodyContent = StringUtils.replace(bodyContent, url, newImgFileName); bodyContent = StringUtils.replace(bodyContent, url, newImgFileName); ImageUtils.copy(filePath + url, imgPath + uuid + extName); } } catch (ParserException e) { e.printStackTrace(); } String result = bodyContent; //去除多余的代码 result = StringUtils.replace(result, "<![endif]>", ""); result = StringUtils.replace(result, "<![if !vml]>", ""); bodyContent = null; return result; } [/color] !-------------------------------> Java代码 /** * 处理body中的图片内容 * @param bodyContent * @return */ public static synchronized String performBodyImg(String bodyContent) { //根据图片名称预览图片action的地址 String newImgSrc = "tumbnail.action?fileName="; //存放word文件的物理位置 String filePath = ResourceBundle.getBundle("sysConfig").getString("userFilePath.word"); //存放图片的物理位置 String imgPath = ResourceBundle.getBundle("sysConfig").getString("userFilePath.image"); Parser parser = Parser.createParser(bodyContent, "gb2312"); ImgTagVisitor imgTag = new ImgTagVisitor(); try { parser.visitAllNodesWith(imgTag); // 得到所有图片地址 List<String> imgUrls = imgTag.getSrcStringList(); for (String url : imgUrls) { String uuid = UUID.randomUUID().toString(); String extName = url.substring(url.lastIndexOf(".")); String newImgFileName = newImgSrc + uuid + extName; bodyContent = StringUtils.replace(bodyContent, url, newImgFileName); bodyContent = StringUtils.replace(bodyContent, url, newImgFileName); ImageUtils.copy(filePath + url, imgPath + uuid + extName); } } catch (ParserException e) { e.printStackTrace(); } String result = bodyContent; //去除多余的代码 result = StringUtils.replace(result, "<![endif]>", ""); result = StringUtils.replace(result, "<![if !vml]>", ""); bodyContent = null; return result; } /** * 处理body中的图片内容 * @param bodyContent * @return */ public static synchronized String performBodyImg(String bodyContent) { //根据图片名称预览图片action的地址 String newImgSrc = "tumbnail.action?fileName="; //存放word文件的物理位置 String filePath = ResourceBundle.getBundle("sysConfig").getString("userFilePath.word"); //存放图片的物理位置 String imgPath = ResourceBundle.getBundle("sysConfig").getString("userFilePath.image"); Parser parser = Parser.createParser(bodyContent, "gb2312"); ImgTagVisitor imgTag = new ImgTagVisitor(); try { parser.visitAllNodesWith(imgTag); // 得到所有图片地址 List<String> imgUrls = imgTag.getSrcStringList(); for (String url : imgUrls) { String uuid = UUID.randomUUID().toString(); String extName = url.substring(url.lastIndexOf(".")); String newImgFileName = newImgSrc + uuid + extName; bodyContent = StringUtils.replace(bodyContent, url, newImgFileName); bodyContent = StringUtils.replace(bodyContent, url, newImgFileName); ImageUtils.copy(filePath + url, imgPath + uuid + extName); } } catch (ParserException e) { e.printStackTrace(); } String result = bodyContent; //去除多余的代码 result = StringUtils.replace(result, "<![endif]>", ""); result = StringUtils.replace(result, "<![if !vml]>", ""); bodyContent = null; return result; } 上面的代码中用到了开源的html解析工具htmlparser 用他来进行分析得到所有图片的链接 然后把图片的链接用Apache的Commons-lang包中的StrutsUtils替换成我修改了fck中预览图片的action 下面是我自己实现ImgTagVisitor 代码 Java代码 package com.bettem.cms.web.utils.htmlparser; import java.util.ArrayList; import java.util.List; import org.htmlparser.Tag; import org.htmlparser.Text; import org.htmlparser.visitors.NodeVisitor; /** * * 说明:htmlparser 解析 Img 标签所用类 * ******************* * 日期 人员 * 2010-2-3 Liqiang */ public class ImgTagVisitor extends NodeVisitor { private List<String> srcList; private StringBuffer textAccumulator; public ImgTagVisitor() { srcList = new ArrayList<String>(); textAccumulator = new StringBuffer(); } public void visitTag(Tag tag) { if (tag.getTagName().equalsIgnoreCase("img")) { srcList.add(tag.getAttribute("src")); } } public List<String> getSrcStringList() { return srcList; } public void visitStringNode(Text stringNode) { String text = stringNode.getText(); textAccumulator.append(text); } public String getText() { return textAccumulator.toString(); } } Java代码 package com.bettem.cms.web.utils.htmlparser; import java.util.ArrayList; import java.util.List; import org.htmlparser.Tag; import org.htmlparser.Text; import org.htmlparser.visitors.NodeVisitor; /** * * 说明:htmlparser 解析 Img 标签所用类 * ******************* * 日期 人员 * 2010-2-3 Liqiang */ public class ImgTagVisitor extends NodeVisitor { private List<String> srcList; private StringBuffer textAccumulator; public ImgTagVisitor() { srcList = new ArrayList<String>(); textAccumulator = new StringBuffer(); } public void visitTag(Tag tag) { if (tag.getTagName().equalsIgnoreCase("img")) { srcList.add(tag.getAttribute("src")); } } public List<String> getSrcStringList() { return srcList; } public void visitStringNode(Text stringNode) { String text = stringNode.getText(); textAccumulator.append(text); } public String getText() { return textAccumulator.toString(); } } package com.bettem.cms.web.utils.htmlparser; import java.util.ArrayList; import java.util.List; import org.htmlparser.Tag; import org.htmlparser.Text; import org.htmlparser.visitors.NodeVisitor; /** * * 说明:htmlparser 解析 Img 标签所用类 * ******************* * 日期 人员 * 2010-2-3 Liqiang */ public class ImgTagVisitor extends NodeVisitor { private List<String> srcList; private StringBuffer textAccumulator; public ImgTagVisitor() { srcList = new ArrayList<String>(); textAccumulator = new StringBuffer(); } public void visitTag(Tag tag) { if (tag.getTagName().equalsIgnoreCase("img")) { srcList.add(tag.getAttribute("src")); } } public List<String> getSrcStringList() { return srcList; } public void visitStringNode(Text stringNode) { String text = stringNode.getText(); textAccumulator.append(text); } public String getText() { return textAccumulator.toString(); } } 6.移除多余的v:imagedata标签 Java代码 /** * 移除多余的v:imagedata标签 * @param content * @return */ public static synchronized String removeImagedataTag(String content) { Parser parser = null; Lexer lexer = null; AndFilter andFilter = null; NodeList nl = null; try { parser = new Parser(content, Parser.STDOUT); lexer = new Lexer(content); andFilter = new AndFilter(new NotFilter(new TagNameFilter("v:imagedata")), new NotFilter(new TagNameFilter("v:imagedata"))); nl = parser.extractAllNodesThatMatch(andFilter); } catch (ParserException e) { e.printStackTrace(); } return nl.toHtml(); } Java代码 /** * 移除多余的v:imagedata标签 * @param content * @return */ public static synchronized String removeImagedataTag(String content) { Parser parser = null; Lexer lexer = null; AndFilter andFilter = null; NodeList nl = null; try { parser = new Parser(content, Parser.STDOUT); lexer = new Lexer(content); andFilter = new AndFilter(new NotFilter(new TagNameFilter("v:imagedata")), new NotFilter(new TagNameFilter("v:imagedata"))); nl = parser.extractAllNodesThatMatch(andFilter); } catch (ParserException e) { e.printStackTrace(); } return nl.toHtml(); } /** * 移除多余的v:imagedata标签 * @param content * @return */ public static synchronized String removeImagedataTag(String content) { Parser parser = null; Lexer lexer = null; AndFilter andFilter = null; NodeList nl = null; try { parser = new Parser(content, Parser.STDOUT); lexer = new Lexer(content); andFilter = new AndFilter(new NotFilter(new TagNameFilter("v:imagedata")), new NotFilter(new TagNameFilter("v:imagedata"))); nl = parser.extractAllNodesThatMatch(andFilter); } catch (ParserException e) { e.printStackTrace(); } return nl.toHtml(); } 在word转html的时候大图片会被自动压缩成小图片 但是原来的大图片还会存在在代码里,上面的代码把多余的标签过滤掉. 最后看下我action中的代码 Java代码 /** * 导入word文件 * * @return */ public synchronized String exportWord() { String content = null; String path = ResourceBundle.getBundle("sysConfig").getString("userFilePath.word"); InputStream ins = null; OutputStream wordFile = null; String htmlPath = null; String wordPath = null; // 处理上传的word文件 try { String uuid = UUID.randomUUID().toString(); // 截取扩展名 String fileName = uuid + filedataFileName.substring(filedataFileName.lastIndexOf(".")); // 生存html文件名 String wordHtmlFileName = uuid + ".html"; ins = new FileInputStream(filedata); wordPath = path + fileName; wordFile = new FileOutputStream(wordPath); IOUtils.copy(ins, wordFile); // word转html htmlPath = path + wordHtmlFileName; WordUtils.word2Html(wordPath, htmlPath); String wordHtmlContent = WordUtils.getHtmlCode(htmlPath); // 处理样式 String styleCode = WordUtils.performStyleCode(wordHtmlContent); String bodyCode = WordUtils.performBodyCode(wordHtmlContent); // 处理文章中的图片 bodyCode = WordUtils.performBodyImg(bodyCode); content = styleCode + bodyCode; styleCode = null; bodyCode = null; WordUtils.removeImagedataTag(content); } catch (FileNotFoundException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } finally { IOUtils.closeQuietly(wordFile); IOUtils.closeQuietly(ins); try { File word = new File(wordPath); File file = new File(htmlPath); if (file.exists()) { file.delete(); word.delete(); FileUtils.deleteDirectory(new File(htmlPath.substring(0, htmlPath.lastIndexOf(".")) + ".files")); } } catch (IOException e) { e.printStackTrace(); } } // 读取word文件内容,添加到content中 // 放到request中 ServletActionContext.getRequest().setAttribute("content", content); ServletActionContext.getRequest().setAttribute("add", true); return SUCCESS; } Java代码 /** * 导入word文件 * * @return */ public synchronized String exportWord() { String content = null; String path = ResourceBundle.getBundle("sysConfig").getString("userFilePath.word"); InputStream ins = null; OutputStream wordFile = null; String htmlPath = null; String wordPath = null; // 处理上传的word文件 try { String uuid = UUID.randomUUID().toString(); // 截取扩展名 String fileName = uuid + filedataFileName.substring(filedataFileName.lastIndexOf(".")); // 生存html文件名 String wordHtmlFileName = uuid + ".html"; ins = new FileInputStream(filedata); wordPath = path + fileName; wordFile = new FileOutputStream(wordPath); IOUtils.copy(ins, wordFile); // word转html htmlPath = path + wordHtmlFileName; WordUtils.word2Html(wordPath, htmlPath); String wordHtmlContent = WordUtils.getHtmlCode(htmlPath); // 处理样式 String styleCode = WordUtils.performStyleCode(wordHtmlContent); String bodyCode = WordUtils.performBodyCode(wordHtmlContent); // 处理文章中的图片 bodyCode = WordUtils.performBodyImg(bodyCode); content = styleCode + bodyCode; styleCode = null; bodyCode = null; WordUtils.removeImagedataTag(content); } catch (FileNotFoundException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } finally { IOUtils.closeQuietly(wordFile); IOUtils.closeQuietly(ins); try { File word = new File(wordPath); File file = new File(htmlPath); if (file.exists()) { file.delete(); word.delete(); FileUtils.deleteDirectory(new File(htmlPath.substring(0, htmlPath.lastIndexOf(".")) + ".files")); } } catch (IOException e) { e.printStackTrace(); } } // 读取word文件内容,添加到content中 // 放到request中 ServletActionContext.getRequest().setAttribute("content", content); ServletActionContext.getRequest().setAttribute("add", true); return SUCCESS; }