本文分类:
Java

本文介绍使用Apache POI的EventModel模式导入大数据量Excel的方法,包含XLS和XLSX格式。EventModel方式基于事件驱动,CPU和内存消耗都非常低,但是只能读不能写。

本文示例代码在https://github.com/hiwzc/java-demo

XLSX 格式

xlsx实际是一种XML格式。示例代码如下:

public class XlsxImporter {

    public static void main(String[] args) throws Exception {
        InputStream is = XlsxImporter.class.getResourceAsStream("/test.xlsx");
        Map<String, List<List<String>>> sheetDatas = XlsxImporter.doImport(is, false);
        PrintUtil.printSheetDatas(sheetDatas);
        System.out.println("Visit hiwzc.com for more information.");
    }

    public static Map<String, List<List<String>>> doImport(InputStream is, boolean formulasNotResults) throws Exception {
        Map<String, List<List<String>>> result = new LinkedHashMap<>();

        try (OPCPackage p = OPCPackage.open(is)) {
            ReadOnlySharedStringsTable strings = new ReadOnlySharedStringsTable(p);

            XSSFReader xssfReader = new XSSFReader(p);
            StylesTable styles = xssfReader.getStylesTable();

            XSSFReader.SheetIterator it = (XSSFReader.SheetIterator) xssfReader.getSheetsData();

            DataFormatter formatter = new DataFormatter();

            while (it.hasNext()) {
                try (InputStream stream = it.next()) {
                    String sheetName = it.getSheetName();
                    List<List<String>> sheetData = new ArrayList<>();
                    result.put(sheetName, sheetData);
                    XMLReader sheetParser = SAXHelper.newXMLReader();
                    sheetParser.setContentHandler(new XSSFSheetXMLHandler(styles, null, strings,
                            new DemoSheetContentsHandler(sheetData), formatter, formulasNotResults));
                    sheetParser.parse(new InputSource(stream));
                }
            }
        }

        return result;
    }

    private static class DemoSheetContentsHandler implements SheetContentsHandler {
        private List<List<String>> sheetData;
        private List<String> line = null;
        private int currRow = -1;
        private int currCol = -1;

        public DemoSheetContentsHandler(List<List<String>> sheetData) {
            this.sheetData = sheetData;
        }

        @Override
        public void startRow(int rowNum) {
            completeMissingRows(rowNum - currRow - 1);
            line = new ArrayList<>();
            currRow = rowNum;
            currCol = -1;
        }

        // 补全空行
        private void completeMissingRows(int number) {
            for (int i = 0; i < number; i++) {
                sheetData.add(Collections.emptyList());
            }
        }

        @Override
        public void endRow(int rowNum) {
            sheetData.add(line);
        }

        @Override
        public void cell(String cellReference, String formattedValue, XSSFComment comment) {
            if (cellReference != null) {
                int thisCol = (new CellReference(cellReference)).getCol();
                completeMissingCols(thisCol - currCol - 1);
                currCol = thisCol;
                line.add(formattedValue);
            }
        }

        // 补全空列
        private void completeMissingCols(int number) {
            for (int i = 0; i < number; i++) {
                line.add(null);
            }
        }

        @Override
        public void headerFooter(String text, boolean isHeader, String tagName) {
            // do nothing
        }
    }
}

XLS 格式

XLS格式是一种二进制文件,整个文件由一个个的Record组成,关于该格式的详细内容,可以参考文末列出的资料。

public class XlsImporter implements HSSFListener {
    // 输出公式还是计算的结果
    private boolean formulasNotResults;

    // 用于解析公式
    private SheetRecordCollectingListener workbookBuildingListener;
    private HSSFWorkbook stubWorkbook;

    // 处理公式的String结果
    private int nextRow;
    private int nextColumn;
    private boolean outputNextStringRecord;

    // 记录 SSR,用于解析 LabelSSTRecord
    private SSTRecord sstRecord;
    private FormatTrackingHSSFListener formatListener;

    // 记录当前处理的 Sheet 及其名称
    private int sheetIndex = -1;
    private List<String> sheetName = new ArrayList<>();

    // 记录结果
    private Map<String, List<List<String>>> result = new LinkedHashMap<>();
    // 当前处理的 Sheet 数据
    private List<List<String>> sheetData;

    public static void main(String[] args) throws Exception {
        InputStream is = XlsImporter.class.getResourceAsStream("/test.xls");
        Map<String, List<List<String>>> sheetDatas = new XlsImporter().doImport(is, false);
        PrintUtil.printSheetDatas(sheetDatas);
        System.out.println("Visit hiwzc.com for more information.");
    }

    public Map<String, List<List<String>>> doImport(InputStream is, boolean formulasNotResults) throws IOException {
        this.formulasNotResults = formulasNotResults;

        MissingRecordAwareHSSFListener listener = new MissingRecordAwareHSSFListener(this);
        formatListener = new FormatTrackingHSSFListener(listener);

        HSSFEventFactory factory = new HSSFEventFactory();
        HSSFRequest request = new HSSFRequest();

        if (this.formulasNotResults) {
            workbookBuildingListener = new SheetRecordCollectingListener(formatListener);
            request.addListenerForAllRecords(workbookBuildingListener);
        } else {
            request.addListenerForAllRecords(formatListener);
        }

        factory.processWorkbookEvents(request, new POIFSFileSystem(is));
        return result;
    }

    @Override
    public void processRecord(Record record) {
        int currRow = -1;
        int currCol = -1;
        String value = null;

        switch (record.getSid()) {
            case BOFRecord.sid:
                // 记录了一个 Workbook 或一个 sheet 的开始
                BOFRecord br = (BOFRecord) record;
                if (br.getType() == BOFRecord.TYPE_WORKSHEET) {
                    sheetIndex++;

                    if (workbookBuildingListener != null && stubWorkbook == null) {
                        stubWorkbook = workbookBuildingListener.getStubHSSFWorkbook();
                    }

                    sheetData = new ArrayList<>();
                    result.put(sheetName.get(sheetIndex), sheetData);
                }
                break;

            case BoundSheetRecord.sid:
                // 保存Sheet的名称
                BoundSheetRecord bsrec = (BoundSheetRecord) record;
                sheetName.add(bsrec.getSheetname());
                break;

            case DimensionsRecord.sid:
                // 记录了每个Sheet的有效起始结束行列索引
                break;

            case RowRecord.sid:
                // 记录了Sheet中的行信息,如行索引、是否隐藏
                RowRecord rrec = (RowRecord) record;
                while (sheetData.size() < rrec.getRowNumber()) {
                    sheetData.add(null);
                }
                sheetData.add(new ArrayList<>(rrec.getLastCol()));
                break;

            case SSTRecord.sid:
                // 存储了在Excel中文本单元格中的文本值,文本单元格通过索引获取文本值
                // SST是Shared String Table的缩写
                sstRecord = (SSTRecord) record;
                break;

            case BlankRecord.sid:
                // 空白单元格,只有样式,没有值
                BlankRecord brec = (BlankRecord) record;
                currRow = brec.getRow();
                currCol = brec.getColumn();
                break;

            case BoolErrRecord.sid:
                // Boolean 或 Error 单元格
                BoolErrRecord berec = (BoolErrRecord) record;
                currRow = berec.getRow();
                currCol = berec.getColumn();
                if (berec.isBoolean()) {
                    value = BoolEval.valueOf(berec.getBooleanValue()).getStringValue();
                } else {
                    value = ErrorEval.getText(berec.getErrorValue());
                }
                break;

            case FormulaRecord.sid:
                // 公式单元格,数字公式的值保存在单元格的中,文本公式的结果保存在一个StringRecord中
                FormulaRecord frec = (FormulaRecord) record;
                currRow = frec.getRow();
                currCol = frec.getColumn();
                if (formulasNotResults) {
                    value = HSSFFormulaParser.toFormulaString(stubWorkbook, frec.getParsedExpression());
                } else {
                    if (frec.hasCachedResultString()) {
                        outputNextStringRecord = true;
                        nextRow = frec.getRow();
                        nextColumn = frec.getColumn();
                    } else if (frec.getCachedResultType() == CellType.ERROR.getCode()) {
                        value = ErrorEval.getText(frec.getCachedErrorValue());
                    } else if (frec.getCachedResultType() == CellType.BOOLEAN.getCode()) {
                        value = BoolEval.valueOf(frec.getCachedBooleanValue()).getStringValue();
                    } else {
                        value = formatListener.formatNumberDateCell(frec);
                    }
                }
                break;

            case StringRecord.sid:
                // 保存文本公式的结果
                if (outputNextStringRecord) {
                    StringRecord srec = (StringRecord) record;
                    value = srec.getString();
                    currRow = nextRow;
                    currCol = nextColumn;
                    outputNextStringRecord = false;
                }
                break;

            case LabelRecord.sid:
                // 文本单元格,文本值直接存储在单元格中
                LabelRecord lrec = (LabelRecord) record;
                currRow = lrec.getRow();
                currCol = lrec.getColumn();
                value = lrec.getValue();
                break;

            case LabelSSTRecord.sid:
                // 文本单元格,文本值引用保存在 SSTRecord 中的值
                LabelSSTRecord lsrec = (LabelSSTRecord) record;
                currRow = lsrec.getRow();
                currCol = lsrec.getColumn();
                if (sstRecord == null) {
                    value = "[SST ERR]";
                } else {
                    value = sstRecord.getString(lsrec.getSSTIndex()).toString();
                }
                break;

            case NumberRecord.sid:
                // 数字单元格
                NumberRecord numrec = (NumberRecord) record;
                currRow = numrec.getRow();
                currCol = numrec.getColumn();
                value = formatListener.formatNumberDateCell(numrec);
                break;

            default:
                break;
        }

        if (value != null) {
            List<String> line = sheetData.get(currRow);
            while (line.size() < currCol) {
                line.add(null);
            }
            line.add(value);
        }
    }
}

参考资料

本文来自 [时光记 - 王智超的个人空间](www.hiwzc.com),转载请注明出处。