如何使用 Apache POI Event API 读取特定行?
我想读取大型 xls 或 xlsx 文件(大约超过 30 MB 并且有 70,000 多行).我能够使用 Apache POI 轻松读取小型 excel 文件,直到出现 OutOfMemory 错误.
I want to read large xls or xlsx file (about more than 30 MB and having 70,000+ rows). I was able to read small excel files using Apache POI eaily until I get an OutOfMemory error.
性能和内存使用是我关心的问题.我阅读了许多帖子,如果内存占用是一个问题,那么对于 XSSF,您可以获取底层 XML 数据,并使用 XSSF 和 SAX(事件 API)自行处理.好吧,我发现它很有趣,现在可以毫无问题地读取整个 xlsx 文件.在不使用事件 API 时,它消耗的内存要少得多(小于 70 MB),而几乎在 GB(如果我将 -Xmx 设置为 1024m 并且它仍然挂起,则最多可达 1GB).
Performance and memory usage is a concern for me. I read through many posts that if memory footprint is an issue, then for XSSF, you can get at the underlying XML data, and process it yourself using XSSF and SAX (Event API). Well, I found it interesting and now can read entire xlsx file without any issue. It consumed a much less memory (less than 70 MB) compared to almost in GB (goes up to 1GB if I had -Xmx set to 1024m and it still used to hang) when not using Event API.
但现在我想自定义读取过程并只允许从 excel 中读取特定行.我可以使用 org.apache.poi.ss.usermodel.Sheet#getRow(int rownum) 轻松做到这一点.但是使用事件 API 它可以无中断地读取所有行,我发现很难读取特定行,例如只是行号 2、3、5 等.以下是我的完整代码:
But now I want to customize the read process and allow only specific rows to be read from an excel. I could easily do this using org.apache.poi.ss.usermodel.Sheet#getRow(int rownum). But using Event API it reads all the rows without any interruption and I find it difficult to read specific rows, e.g. just row numbers 2,3,5, etc. Below is my entire code:
import java.io.InputStream;
import java.util.Iterator;
import java.util.Vector;
import org.apache.poi.xssf.eventusermodel.XSSFReader;
import org.apache.poi.xssf.model.SharedStringsTable;
import org.apache.poi.xssf.usermodel.XSSFRichTextString;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.helpers.XMLReaderFactory;
/**
* XSSF and SAX (Event API)
*/
public class FromHowTo {
public void processAllSheets(String filename) throws Exception {
OPCPackage pkg = OPCPackage.open(filename);
XSSFReader r = new XSSFReader( pkg );
SharedStringsTable sst = r.getSharedStringsTable();
XMLReader parser = fetchSheetParser(sst);
Iterator<InputStream> sheets = r.getSheetsData();
while(sheets.hasNext()) {
InputStream sheet = sheets.next();
InputSource sheetSource = new InputSource(sheet);
parser.parse(sheetSource);
sheet.close();
}
}
public XMLReader fetchSheetParser(SharedStringsTable sst) throws SAXException {
XMLReader parser = XMLReaderFactory.createXMLReader("org.apache.xerces.parsers.SAXParser");
ContentHandler handler = new SheetHandler(sst);
parser.setContentHandler(handler);
return parser;
}
/**
* See org.xml.sax.helpers.DefaultHandler javadocs
*/
private static class SheetHandler extends DefaultHandler {
private SharedStringsTable sst;
private String lastContents;
private boolean nextIsString;
Vector values = new Vector(10);
private SheetHandler(SharedStringsTable sst) {
this.sst = sst;
}
public void startElement(String uri, String localName, String name, Attributes attributes) throws SAXException {
// c => cell
if(name.equals("c")) {
// Figure out if the value is an index in the SST
String cellType = attributes.getValue("t");
//System.out.println(cellType);
if(cellType != null && cellType.equals("s")) {
nextIsString = true;
} else {
nextIsString = false;
}
}
// Clear contents cache
lastContents = "";
}
public void endElement(String uri, String localName, String name) throws SAXException {
// Process the last contents as required.
// Do now, as characters() may be called more than once
if(nextIsString) {
try {
int idx = Integer.parseInt(lastContents);
lastContents = new XSSFRichTextString(sst.getEntryAt(idx)).toString();
} catch (NumberFormatException e) {
}
}
// v => contents of a cell
// Output after we've seen the string contents
if(name.equals("v")) {
values.add(lastContents);
}
if(name.equals("row")) {
System.out.println(values);
values.removeAllElements();
}
}
public void characters(char[] ch, int start, int length) throws SAXException {
lastContents += new String(ch, start, length);
}
}
public static void main(String[] args) throws Exception {
FromHowTo howto = new FromHowTo();
howto.processAllSheets(args[0]);
}
}
我将 JRE7 与 Apache POI 3.7 一起使用.有人可以帮我使用 Event API 获取特定行吗?
I am using JRE7 with Apache POI 3.7. Can someone please help me getting specific rows with Event API?
每个行开始元素都有一个行号.它可以从属性中检索
each row start element has a row number. it can be retrieved from the attributes
long rowIndex = Long.valueOf(attributes.getValue("r"));
long rowIndex = Long.valueOf(attributes.getValue("r"));
事件模型将遍历所有行,但您可以在 endElement 中获取索引并相应地处理您的数据
The event model will go through to all rows but you can get he index and handle your data accordingly in the endElement