IT_Programming/Dev Libs & Framework

[Apache POI using Java] MS word, excel, powerpoint 데이터 핸들링

JJun ™ 2011. 6. 20. 13:04

-----------------------------------------------------------------------

출처: http://rollin96.springnote.com/pages/5597379
       http://buedt.tistory.com/58

-----------------------------------------------------------------------




JAKARTA_POI.doc

 


Apache POI를 이용하여 Excel파일(.xls, .xlsx) 처리하기

 


Table of contents

작성자 정보
  • 작성일 : 2009-11-29
  • 작성자 : 박문석 (커뮤니케이션개발팀)
  • 문서정보 :
    User Edits Comments Watches
    박문석 2 0 0

Abstraction

Apache POI 프로젝트의 개요와 Excel파일(.xls, .xlsx) 처리하는 방법을 설명함.



내용
POI(Poor Obfuscation Implementation)

POI는 MicroSoft 사의 OLE2 기반의 파일 형식을 자바로 개발하기 위한 Apache Software Foundation의 프로젝트 중 하나이다. POI는 MS Format File을 자바를 이용하여 액세스 할 수 있는 API를 제공한다.


POI의 서브 프로젝트
  • POIFS : OLE2 Compound Document Formats 를 읽기/쓰기 위한 API (기본적으로 POI의 모든 컴포넌트들이 POIFS를 사용)
  • HSSF, XSSF : ExcelExcel 스프레드시트들을 읽기, 쓰기 위한 API (HSSF는 97-2007까지의 Excel 포멧을 지원 XSSF는 .xlsx 확장자를 지원하는 OOXML을 지원)
  • HWPF : WordWord 문서들을 읽기, 쓰기 위한 API
  • HSLF : PowerPointPowerPoint 문서들을 읽기, 쓰기 위한 API
  • HPSF : OLE2OLE2 포맷 형식의 문서 property sets(속성 세트들)를 읽기 위한 API


POI를 이용한 Excel문서(.xls, .xlsx) 처리하기
  • Apache POI 프로젝트 사이트에 방문하여 최신 버전을 다운로드

  • 필요한 라이브러리 목록
    • POI 라이브러리
      • poi-3.5-FINAL-20090928.jar
      • poi-contrib-3.5-FINAL-20090928.jar
      • poi-ooxml-3.5-FINAL-20090928.jar
      • poi-scratchpad-3.5-FINAL-20090928.jar
    • ooxml 라이브러리
      • dom4j-1.6.1.jar : DOM 구조를 Java 객체로 매핑 혹은 변환
      • geronimo-stax-api_1.0_spec-1.0.jar : 이건 JDK 1.6버전부터 있는 javax.xml.stream 패키지를 JDK 1.5에서 사용할 수 있도록하는 라이브러리
      • ooxml-schemas-1.0.jar : Office Open XML의 스키마
      • xmlbeans-2.3.0.jar : Java Type으로 XML에 접근할 수 있도록하는 라이브러리

  • 사용 예제
    package net.study.poi.hssf;
    import java.io.File;
    import java.io.FileInputStream;
    import java.io.IOException;
    import org.apache.poi.hssf.extractor.ExcelExtractor;
    import org.apache.poi.hssf.usermodel.HSSFWorkbook;
    import org.apache.poi.ss.usermodel.Cell;
    import org.apache.poi.ss.usermodel.DateUtil;
    import org.apache.poi.ss.usermodel.Row;
    import org.apache.poi.ss.util.CellReference;
    import org.apache.poi.xssf.extractor.XSSFExcelExtractor;
    import org.apache.poi.xssf.usermodel.XSSFWorkbook;
    public class ExcelReader {
      public void read(String excel) throws IOException {
                      // check file
        File file = new File(excel);
        if (!file.exists() || !file.isFile() || !file.canRead()) {
             throw new IOException(excel);
        }
        // Workbook
        HSSFWorkbook wb = new HSSFWorkbook(new FileInputStream(file));
       // Text Extraction
       ExcelExtractor extractor = new ExcelExtractor(wb);
       extractor.setFormulasNotResults(true);
       extractor.setIncludeSheetNames(false);
       System.out.println( extractor.getText() );
       // Getting cell contents
       for( int i=0; i<wb.getNumberOfSheets(); i++) {
         for( Row row : wb.getSheetAt(i) ) {
           for( Cell cell : row ) {
              // Cell Reference
              CellReference cellRef = new CellReference(row.getRowNum(), cell.getColumnIndex());
              System.out.print(cellRef.formatAsString());
              System.out.print(" - ");
              switch( cell.getCellType() ) {
                 case Cell.CELL_TYPE_STRING :
                     System.out.println( cell.getRichStringCellValue().getString() );
                 break;
                 case Cell.CELL_TYPE_NUMERIC :
                    if(DateUtil.isCellDateFormatted(cell)) {
                      System.out.println(cell.getDateCellValue());
                    } else {
                      System.out.println(cell.getNumericCellValue());
                    }
                 break;
                 case Cell.CELL_TYPE_BOOLEAN :
                      System.out.println(cell.getBooleanCellValue());
                 break;
                 case Cell.CELL_TYPE_FORMULA :
                      System.out.println(cell.getCellFormula());
                 break;
                 default:
                      System.out.println();
             }
           }
          }
        }
      }
      public void read2007(String excel) throws IOException {
            // check file
        File file = new File(excel);
        if (!file.exists() || !file.isFile() || !file.canRead()) {
            throw new IOException(excel);
        }
        // Workbook
        XSSFWorkbook wb = new XSSFWorkbook(new FileInputStream(file));
        // Text Extraction
        XSSFExcelExtractor extractor = new XSSFExcelExtractor(wb);
        extractor.setFormulasNotResults(true);
        extractor.setIncludeSheetNames(false);
        System.out.println( extractor.getText() );
        // Getting cell contents
        for( int i=0; i<wb.getNumberOfSheets(); i++) {
           for( Row row : wb.getSheetAt(i) ) {
              for( Cell cell : row ) {
                   // Cell Reference
                   CellReference cellRef = new CellReference(row.getRowNum(), cell.getColumnIndex());
                   System.out.print(cellRef.formatAsString());
                   System.out.print(" - ");
                   switch( cell.getCellType() ) {
                       case Cell.CELL_TYPE_STRING :
                           System.out.println( cell.getRichStringCellValue().getString() );
                       break;
                       case Cell.CELL_TYPE_NUMERIC :
                          if(DateUtil.isCellDateFormatted(cell)) {
                             System.out.println(cell.getDateCellValue());
                          } else {
                             System.out.println(cell.getNumericCellValue());
                          }
                          break;
                      case Cell.CELL_TYPE_BOOLEAN :
                             System.out.println(cell.getBooleanCellValue());
                      break;
                      case Cell.CELL_TYPE_FORMULA :
                             System.out.println(cell.getCellFormula());
                      break;
                      default:
                             System.out.println();
                   }
                 }
              }
           }
        }
        
        public static void main(String argv[]) {
            try {
               // ExcelReader
               ExcelReader reader = new ExcelReader();
               // Test File
               String excel = reader.getClass().getResource("/point.xls").getPath();
               String excel2007 = reader.getClass().getResource("/point.xlsx").getPath();
              // reader.read(excel);
              reader.read2007(excel2007);
           } catch (Exception e) {
                System.out.println("Exception occurred! : " + e);
           }
       }
    }
                 

References



 

 




1. 엑셀 읽어 DB에 저장.

Action 부분 : 업로드 파일을 받아서 서버에 저장.

// 파일 이름 중복 방지
SimpleDateFormat sdf = new SimpleDateFormat("yyyyMMddHHmmss");
String rndName = sdf.format(new java.util.Date()) + System.currentTimeMillis();
// 실제 저장될 파일 이름
String realName = rndName + ".xls";
// 실제로 저장될 파일 풀 경로
File file = new File( "경로" +"/", realName);
// 저장하기(복사)
is = vo.getUploadFile().getInputStream();
os = new FileOutputStream(file);
int b = 0; 
int buf_length = 4096;
byte[] buff = new byte[buf_length];
while ((b = is.read(buff, 0, buf_length)) != -1) {
	os.write(buff, 0, b);
	os.flush();
}
vo.setUpload_file(realName);


UTIL(DAO)부분 : 엑셀 파일을 DB에 저장
// 엑셀파일 로드
File file = new File("경로" +"/", vo.getUpload_file());
POIFSFileSystem fs=new POIFSFileSystem(new FileInputStream(file));
HSSFWorkbook wb=new HSSFWorkbook(fs);
HSSFSheet sheet=wb.getSheetAt(0); // 시트 가져오기
int rows=sheet.getPhysicalNumberOfRows(); // Row 갯수 가져오기
int cells=sheet.getRow(0).getPhysicalNumberOfCells(); // cell 갯수 가져오기
if ( rows >= 2 ) { // 1번째 줄일 제목 줄이기 때문에 2번부터 시작
    for(int i=1; i<rows; i++){ // row 루프
        HSSFRow row=sheet.getRow(i); // row 가져오기
        if(row != null && !"".equals(row.getCell(0).getStringCellValue())){
            int idx=1;
            for(int c=0; c<cells;   c++){ // cell 루프
                HSSFCell cell = row.getCell(c); // cell 가져오기
                String value = "";
                if(cell!=null){
                    switch(cell.getCellType()){ // cell 타입에 따른 데이타 저장
                        case HSSFCell.CELL_TYPE_FORMULA:
                            value=cell.getCellFormula();
                            break;
                        case HSSFCell.CELL_TYPE_NUMERIC:
                            value=""+cell.getNumericCellValue();
                            break;
                        case HSSFCell.CELL_TYPE_STRING:
                            value=""+cell.getStringCellValue();
                            break;
                        case HSSFCell.CELL_TYPE_BLANK:
                            //value=""+cell.getBooleanCellValue();
                            value="";
                            break;
                        case HSSFCell.CELL_TYPE_ERROR:
                            value=""+cell.getErrorCellValue();
                            break;
                        default:
                    }
                }
                stmt.setString(idx++, value.trim());
                
            }
            result = stmt.executeUpdate(); //쿼리문 insert 실행
        }
    }
}
logger.info("완료!!");



간단하기때문에 jakarta 홈페이지에 가서 guide를 참조하면서 개발 하면 된다.



2. DB에서 자료 읽어서 서버에 엑셀 저장 후 다운로드

// Excel Write
HSSFWorkbook workbook = new HSSFWorkbook();
HSSFSheet sheet = workbook.createSheet("Sheet1");

// Font 설정.
HSSFFont font = workbook.createFont();
font.setFontName(HSSFFont.FONT_ARIAL);
//제목의 스타일 지정
HSSFCellStyle titlestyle = workbook.createCellStyle();
titlestyle.setFillForegroundColor(HSSFColor.SKY_BLUE.index);
titlestyle.setFillPattern(HSSFCellStyle.SOLID_FOREGROUND);
titlestyle.setAlignment(HSSFCellStyle.ALIGN_CENTER);
titlestyle.setFont(font);
// Row 생성
HSSFRow row = sheet.createRow((short)0);
// Cell 생성
HSSFCell cell1 = row.createCell((short)0 );
cell1.setCellValue("제목1");
cell1.setCellStyle(titlestyle);
HSSFCell cell2 = row.createCell((short)1);
cell2.setCellValue("제목2");
cell2.setCellStyle(titlestyle);
HSSFCell cell3 = row.createCell((short)2);
cell3.setCellValue("제목3");
cell3.setCellStyle(titlestyle);
// 내용 스타일 지정
HSSFCellStyle style = workbook.createCellStyle();
style.setFont(font);
// 내용 중 가운데 정렬 추가
HSSFCellStyle styleCenter = workbook.createCellStyle();
styleCenter.setAlignment(HSSFCellStyle.ALIGN_CENTER);
styleCenter.setFont(font);
for (int i=0; i<list.size();i++){
    row = sheet.createRow((short)(i+1));
   
    cell1 = row.createCell((short)0 );
    cell1.setCellValue("내용1");
    cell1.setCellStyle(styleCenter);
    
    cell2 = row.createCell((short)1);
    cell2.setCellValue("내용2");
    cell2.setCellStyle(style);
    
    cell3 = row.createCell((short)2);
    cell3.setCellValue("내용3");
    cell3.setCellStyle(style);
    
    
}
// 실제 저장될 파일 이름
realName = "TEST.xls";
// 실제로 저장될 파일 풀 경로
File file = new File("경로" +"/", realName);
// 엑셀 파일을 만듬
FileOutputStream fileOutput = new FileOutputStream(file);
workbook.write(fileOutput);
fileOutput.close();
logger.info("Excel File 생성 OK");


참조 : 
http://poi.apache.org/spreadsheet/quick-guide.html

 


 


 

 

 

POI - HSSF

  • java에서 MS엑셀 파일을 쓰기가 필요하여 사용한 POI-HSSF를 간단히 소개한다
  • 사용자에게 제공되는 서비스에 이용하기 보다는 관리자에게 서비스에 관한 데이타를 제공할때 사용되면 좋을듯 싶다

예제

  • 아래 코드는 DB에서 400개 단위로 읽어온 데이타를 기록하는 코드이다
  • workbook, sheet, row, cell 등이 object로 제공된다
          // workbook
          HSSFWorkbook wb = new HSSFWorkbook();
          // sheet
          HSSFSheet sheet = wb.createSheet("sheetname");
          // title cell
          sheet.createRow((short) 0).createCell((short) 0).setCellValue("Position");
          sheet.createRow((short) 0).createCell((short) 1).setCellValue("DaumID");
          sheet.createRow((short) 0).createCell((short) 2).setCellValue("[DaumName]");
          sheet.createRow((short) 0).createCell((short) 3).setCellValue("imgUrl");
          int startpoint = currentpage * 400 + 1;
          int index = 0;
          for (int i = startpoint; i < testList.size() + startpoint; i++) {
                  sheet.createRow((short) i).createCell((short) 0).setCellValue(testList.get(index).getPosition());
                  sheet.createRow((short) i).createCell((short) 1).setCellValue(testList.get(index).getDaumid());
                  // 한글 인코딩
           HSSFCell nametmp = sheet.createRow((short) i).createCell((short) 2);
                  nametmp.setEncoding(HSSFCell.ENCODING_UTF_16);
                  nametmp.setCellValue(testList.get(index).getDaumname());                        
                  sheet.createRow((short) i).createCell((short) 3).setCellValue(tesetList.get(index).getUrl());                 
                  index++;
          }
          [FileOutputStream] fileOut = new [FileOutputStream]("/hanmail/test.xls");
          wb.write(fileOut);
          fileOut.close();
          
  • 2번째 cell은 한글 인코딩이 필요하여 cell을 만들어 놓은 후 setEncoding을 한것을 볼 수 있다
  • 기존 파일 write와 같이 workbook을 write 하면 된다

etc

  • jakarta POI
      POIFS : Microsoft의 OLE2 포맷 형식의 문서를 자바로 읽고 쓸 수 있는 컴포넌트
      HSSF  : Microsoft의 엑셀파일을 읽고 쓸수 있도록 지원하는 컴포넌트
      HWPF  : Microsoft의 워드파일을 읽고 쓸수 있도록 지원하는 컴포넌트
      HPSF  : Microsoft의 OLE2 포맷 형식의 문서 속성을 어플리케이션에서 사용 할 수                있도록 지원하는 컴포넌트
  • cell의 위치만 기억해 놓는다면 append도 가능하다
  • 기존의 파일 쓰기보다는 많이 느린편이다
  • 같은 cell의 위치에 쓰게되면 중복이되어 나중에 남게된다
  • cell의 병합이나 색깔넣기등 다양한 편집기능도 가능하고 POI 3.0 에서 부터는 이미지넣기 까지 가능하다
  • 최근 POI 프로젝트가 Ruby에도 바인딩 되었다
  • download
     http://jakarta.apache.org/site/downloads/downloads_poi.cgi      

 

 

 



 



 

Microsoft Word Documents

                                                                                                     Last updated Jun 15, 2007.

While Apache POI's support for Microsoft Word documents is simplistic, it does provide the capabilities to read the contents of a Word document. In this section I review how to use POI to extract the contents of a Word document.

Word documents are maintained by a POIFSFileSystem class, just like the other documents in this series. Thus to access a Word document, you start with a POIFSFileSystem instance:



POIFSFileSystem fs = new POIFSFileSystem( new FileInputStream( filename ) );

While the POIFSFileSystem provides access to the underlying document, the specific class that represents a Word document is the org.apache.poi.hwpf.HWPFDocument class. At this point in its development, this class provides access to the underlying components of the Word document, but it does not provide a useful higher level abstraction. To extract the textual contents of the document, POI provides another helper class: org.apache.poi.hwpf.extractor.WordExtractor. This class provide three methods, shown in table 1.

Table 1. WordExtractor methods

Method

Description

String[] getParagraphText()

Retrieves the text from a Word file as an array of paragraph Strings (each String represents a paragraph in from the document)

String getText()

Returns all of the text of the Word document

String getTextFromPieces()

Returns the text from the document pieces, which may contain markup text; works when the text piece to paragraph mapping is broken

A WordExtractor can be created from an HWPFDocument instance as follows:

WordExtractor we = new WordExtractor( doc );

And then we can iterate over the entire Word document, paragraph-by-paragraph by using thegetParagraphText() method. For example:

// Extract all paragraphs in the document as strings
String[] paragraphs = we.getParagraphText();
// Output the document
System.out.println( "Word Document has " + paragraphs.length + " paragraphs" );
for( int i=0; i<paragraphs.length; i++ )
{
      System.out.println( paragraphs[ i ] );
}

Listing 1 puts this all together into a working example that accepts as a command line input a Word document and then displays its contents to the standard output.

Listing 1. WordSampleReader.java

package com.informit.poi;
// Import POI classes
import org.apache.poi.poifs.filesystem.*;
import org.apache.poi.hwpf.*;
import org.apache.poi.hwpf.extractor.*;
// Import Java classes
import java.io.*;
import java.util.*;
public class WordSampleReader
{
  public static void main( String[] args )
  {
   if( args.length == 0 )
   {
     System.out.println( "Usage: WordSampleReader <filename>" );
     System.exit( 0 );
   }
   String filename = args[ 0 ];
   try
   {
     // Create a POI File System object; this is the main class for the POIFS file system
     // and it manages the entire lifecycle of the file system
     POIFSFileSystem fs = new POIFSFileSystem( new FileInputStream( filename ) );
     // Create a document for this file
     HWPFDocument doc = new HWPFDocument( fs );
     // Create a WordExtractor to read the text of the word document
     WordExtractor we = new WordExtractor( doc );
     // Extract all paragraphs in the document as strings
     String[] paragraphs = we.getParagraphText();
     // Output the document
     System.out.println( "Word Document has " + paragraphs.length + " paragraphs" );
     for( int i=0; i<paragraphs.length; i++ )
     {
      System.out.println( paragraphs[ i ] );
     }
   }
   catch( Exception e )
   {
     e.printStackTrace();
   }
  }
}

Listing 1 opens the specified file by creating a FileInputStream to it and building a POIFSFileSystemobject. It then constructs an HWPFDocument object from the POIFSFileSystem and a WordExtractor from the HWPFDocument. It extracts all paragraphs from the WordExtractor and then iteratively displays each paragraph to the standard output.

The output is free of all markup, such as document style, underlining, bolding, and so forth. I tested this program by parsing my Java Reference Guide update (in its raw Word format) from last week (the POI PowerPoint update) and it displayed the following output, abbreviated:

Word Document has 140 paragraphs
(c)Jakarta POI
(d)Microsoft PowerPoint Documents
Having taught dozens of classes and given hundreds of presentations, Microsoft PowerPoint has become one of my trusty companions. (A quick search of my local documents revealed over 300 PowerPoint presentations.) And while I might not feel the need to compose a slide presentation without Microsoft PowerPoint, the oppor
tunity to read my PowerPoint presentations and search for content from within a Java application is very appealing.
...

The only "noise" that it displayed at the end of the document included the template name as well as an internal date stamp:

MACGLOBL.DOT
? DATE \l ¶6/3/2007§? TIME ¶9:35 PM§
MACGLOBL.DOT
? DATE \l ¶6/3/2007§? TIME ¶9:35 PM§

While not perfect, it does provide you with access to read and search the contents of a Word document.

Summary

Microsoft maintains a collection of proprietary file formats that make interoperability with Java a difficult task. The Jakarta POI project bridges that gap by providing a pure Java solution to reading from and writing to these proprietary file formats.

Thus far in this series we have explored how POI can be used to read Microsoft Excel, PowerPoint, and Word files. As a project for you to embark on if you want more experience using POI, try the following suggestion: build a document search engine. Accept as input a directory to search and then iteratively open each document. If the document has a ".doc" extension then open it with the Word reader, an ".xls" extension then open it with the Excel reader, or a ".ppt" extension then open it with the PowerPoint reader. For a simple search you can load the entire document into a String and then search for a specific substring using the indexOf() method. And for the truly adventurous, you can incorporate the Jakarta Regular expression library and go to town.

 

 


 


 

POI HSLF

Java API To Access Microsoft Powerpoint Format Files

 

 

1. POI API Document

http://jakarta.apache.org/poi/apidocs/index.html

 

2. POI Download

http://www.apache.org/dyn/closer.cgi/jakarta/poi/

 

3. 간단한 텍스트 읽기

<%@ page import="java.io.*"%>
<%@ page import="org.apache.poi.hslf.model.*"%>
<%@ page import="org.apache.poi.hslf.usermodel.*"%>
<%@ page import="org.apache.poi.hslf.*"%>

<%

    SlideShow src =

          new SlideShow(

          new HSLFSlideShow("C:\\Web\\Tomcat 5.5\\webapps\\ROOT\\test.ppt"));


    Slide[] sl = src.getSlides();


    for (int i = 0; i < sl.length; i++) {
        Slide s = sl[i];
        TextRun[] trs = s.getTextRuns();
        for (int k = 0; k < trs.length; k++) {
            TextRun tr = trs[k];
            System.out.println(tr.getText());
           }
    }

%>

 

4. 테스트버젼

PPT 2000

 


 

 


JAKARTA_POI.doc
0.26MB