pdfBox对中文非常不友好,如果各位同学最进要对pdf进行插入文字操作的话,建议你们使用itext,如果你操作的pdf没有中文,或者只是对pdf文件插入图片,删除页面等操作,那么请继续看下去~~~~
前言:
前段时间在完成公司安排的任务同时,利用空余时间做了一个使用java操作pdf的功能 刚开始没什么头绪,直到在网上找到了pdfBox, pdfBox是apach提供的免费,开源的pdf操作工具,使用起来也挺方便,github可下载 我也上传了一份, [ pdfbox-1.8.9.zip ]
1首先,导入jar
我是maven方式导入 PS: 这个jar里面囊括了所有的pdfbox操作工具类,导入这一个就够了 (我在找工具类的时候,看到别的博主导了pdfbox的很多类,然后一股脑也导了进去,结果jar包冲突,原来只导入一个,那就是官方已经整合好的那个,就够了)
<dependency>
<groupId>org.apache.pdfbox
</groupId>
<artifactId>pdfbox-app
</artifactId>
<version>1.8.10
</version>
</dependency>
2.在你的项目中创建一个工具类
2.1这个类的取名:随意, 我是取的pdfUtil
2.2当然,如果你想将操作记录入录到数据库的话,你也可以创建一个pdf的实体类 这个实体类创不创建大家随意,我贴一下我的实体类的属性,供参考
private Integer id;
private Date time;
private String filename;
private String filesize;
private String filetype;
private String details;
private String content;
private String outputfile;
private String inputfile;
private String strtofind;
private String message;
private String imagefile;
private String imagelist;
private Integer pageno;
private Integer pages;
private Integer rid;
private Integer pageoperation;
private Integer pagestart;
private Integer pageend;
private String position;
private String fileSizeAfter;
private Integer status;
private Integer afterPages;
private Integer imgSize;
3.在pdfUtil写代码
PS:我下面会有用到pdfDomainVO实体类的时候,大家参考下上面贴的属性
大家可以在pdfbox-1.8.9.zip文件夹中,找到examples文件夹
里面有很多事例,比如:
1创建一个pdf文件
2读取pdf中,全部文字信息(可用String接收)
3替换pdf中字符(中文我还没有解决好,不好意思啊)
4在pdf中插入图片
等等操作……
PS:我现在贴一下我的代码
—–1创建1到多个空白页面
/***
* 创建1到多个空白页面
* @param file
* @throws IOException
* @throws COSVisitorException
*/
public static void createBlank( String outputFile )
throws IOException, COSVisitorException
{
PDDocument document =
null;
try
{
document =
new PDDocument();
PDPage blankPage =
new PDPage();
PDPage blankPage1 =
new PDPage();
PDPage blankPage2 =
new PDPage();
document.addPage( blankPage );
document.addPage( blankPage1 );
document.addPage( blankPage2 );
document.save( outputFile );
System.out.println(
"over");
}
finally
{
if( document !=
null )
{
document.close();
}
}
}
—–2读取pdf中文字信息(全部)
/**
* 读取pdf中文字信息(全部)
*/
public static void READPDF(String inputFile){
PDDocument doc =
null;
String content=
"";
try {
doc =PDDocument.load(
new File(inputFile));
PDFTextStripper textStripper =
new PDFTextStripper(
"GBK");
content=textStripper.getText(doc);
vo.setContent(content);
System.out.println(
"内容:"+content);
System.out.println(
"全部页数"+doc.getNumberOfPages());
doc.close();
}
catch (Exception e) {
}
}
—–3读取pdf中文字信息(指定页面)
/**
* 读取pdf中文字信息(指定从第几页开始)
*/
public static pdfDomainVO
readPageNO(pdfDomainVO vo){
String content=
"";
try{
PDDocument document = PDDocument.load(vo.getInputfile());
int pages = document.getNumberOfPages();
PDFTextStripper stripper=
new PDFTextStripper();
stripper.setSortByPosition(
true);
stripper.setStartPage(vo.getPageno());
stripper.setEndPage(vo.getPageno());
content = stripper.getText(document);
vo.setContent(content);
System.out.println(
"function : readPageNO over");
}
catch (Exception e) {
e.printStackTrace();
}
return vo;
}
—–4替换指定pdf文件的文字内容(这个比较复杂,当时看api看了好久,然后一个一个的吧注释添了上去)
public static pdfDomainVO
replaceContent(pdfDomainVO vo)
throws IOException,COSVisitorException{
PDDocument doc =
null;
try {
doc =PDDocument.load(vo.getInputfile());
List pages= doc.getDocumentCatalog().getAllPages();
PDPage page = (PDPage)pages.
get( vo.getPageno() );
PDStream contents = page.getContents();
PDFStreamParser parser =
new PDFStreamParser(contents.getStream());
parser.parse();
List tokens =parser.getTokens();
for (
int j =
0; j < tokens.size(); j++) {
Object next = tokens.
get( j );
if(next instanceof PDFOperator ) {
PDFOperator op =(PDFOperator)next;
if(op.getOperation().equals(
"Tj")){
COSString previous = (COSString)tokens.
get( j-
1 );
String
string=previous.getString();
string =
string.replaceFirst( vo.getStrtofind(), vo.getMessage() );
System.
out.println(
string);
System.
out.println(
string.getBytes(
"GBK"));
previous.reset();
previous.append(
string.getBytes(
"GBK") );
}
else if(op.getOperation().equals(
"TJ")){
COSArray previous =(COSArray)tokens.
get( j-
1 );
for (
int k =
0; k < previous.size(); k++) {
Object arrElement = previous.getObject( k );
if( arrElement instanceof COSString ){
COSString cosString =(COSString)arrElement;
String
string =cosString.getString();
string =
string.replaceFirst( vo.getStrtofind(), vo.getMessage());
cosString.reset();
cosString.append(
string.getBytes(
"GBK") );
}
}
}
}
}
PDStream updatedStream =
new PDStream(doc);
OutputStream
out =updatedStream.createOutputStream();
ContentStreamWriter tokenWriter =
new ContentStreamWriter(
out);
tokenWriter.writeTokens(tokens);
page.setContents( updatedStream );
doc.save(vo.getOutputfile());
vo.setAfterPages(doc.getNumberOfPages());
}
catch (Exception e) {
e.printStackTrace();
}
finally{
if( doc !=
null ){
doc.close();
}
}
return vo;
}
—–5在pdf中插入图片(按指定页数插入)
/**
* 在pdf中插入图片
* @param inputFile
* @param image
* @param outputFile
* @throws IOException
* @throws COSVisitorException
*/
public static pdfDomainVO
insertImage( pdfDomainVO vo )
throws IOException, COSVisitorException{
String[] position =vo.getPosition().split(
",");
int x =Integer.valueOf(position[
0]);
int y =Integer.valueOf(position[position.length-
1]);
PDDocument doc =
null;
try {
doc = PDDocument.load(vo.getInputfile());
PDPage page = (PDPage)doc.getDocumentCatalog().getAllPages().get( vo.getPageno() );
PDXObjectImage ximage =
null;
if( vo.getImagefile().toLowerCase().endsWith(
".jpg" ) ){
ximage =
new PDJpeg(doc,
new FileInputStream( vo.getImagefile() ) );
}
else if (vo.getImagefile().toLowerCase().endsWith(
".tif") || vo.getImagefile().toLowerCase().endsWith(
".tiff")){
ximage =
new PDCcitt(doc,
new RandomAccessFile(
new File(vo.getImagefile()),
"r"));
}
else{
BufferedImage awtImage = ImageIO.read(
new File( vo.getImagefile() ) );
ximage =
new PDPixelMap(doc, awtImage);
}
PDPageContentStream contentStream =
new PDPageContentStream(doc, page,
true,
true);
float scale = vo.getImgSize();
scale = scale/
10;
System.out.println(ximage.getHeight());
System.out.println(ximage.getWidth());
contentStream.drawXObject(ximage, x, y, ximage.getWidth()*scale, ximage.getHeight()*scale);
contentStream.close();
doc.save( vo.getOutputfile() );
vo.setAfterPages(doc.getNumberOfPages());
}
catch (Exception e) {
e.printStackTrace();
}
finally{
if( doc !=
null ){
doc.close();
}
}
return vo;
}
—–6指定页数的PDF文件转换为图片
/***
* 指定页数的PDF文件转换为图片:
* @param inputFile
* @param outputFile 这里指定文件夹
*/
public static pdfDomainVO
toImage( pdfDomainVO vo ) {
try {
PDDocument doc = PDDocument.load(vo.getInputfile());
List pages = doc.getDocumentCatalog().getAllPages();
if(vo.getPageno()!=
null){
String count=(
int)(Math.random()*
1000)+
"-"+(
int)(Math.random()*
1000);
PDPage page = (PDPage) pages.get(vo.getPageno());
BufferedImage image = page.convertToImage();
Iterator iter = ImageIO.getImageWritersBySuffix(
"jpg");
ImageWriter writer = (ImageWriter) iter.next();
File outFile =
new File(vo.getOutputfile()+vo.getFilename()+
"-"+(vo.getPageno()+
1)+
".jpg");
FileOutputStream out =
new FileOutputStream(outFile);
ImageOutputStream outImage = ImageIO.createImageOutputStream(out);
writer.setOutput(outImage);
writer.write(
new IIOImage(image,
null,
null));
}
else{
for (
int i =
0; i < pages.size(); i++) {
PDPage page = (PDPage) pages.get(i);
BufferedImage image = page.convertToImage();
Iterator iter = ImageIO.getImageWritersBySuffix(
"jpg");
ImageWriter writer = (ImageWriter) iter.next();
File outFile =
new File(vo.getOutputfile()+i+
".jpg");
FileOutputStream out =
new FileOutputStream(outFile);
ImageOutputStream outImage = ImageIO.createImageOutputStream(out);
writer.setOutput(outImage);
writer.write(
new IIOImage(image,
null,
null));
}
}
doc.close();
vo.setAfterPages(doc.getNumberOfPages());
System.out.println(
"over");
}
catch (Exception e) {
e.printStackTrace();
}
return vo;
}
—–7指定页插入一段文字(大家可自调字体,插入文字的位置)
/***
* 指定页插入一段文字
* @param inputFile
* @param message
* @param outputFile
* @throws IOException
* @throws COSVisitorException
*/
public static pdfDomainVO
InsertPageContent (pdfDomainVO vo )
throws IOException, COSVisitorException
{
PDDocument doc =
null;
try
{
doc = PDDocument.load( vo.getInputfile() );
List allPages = doc.getDocumentCatalog().getAllPages();
PDFont font = PDType1Font.HELVETICA_BOLD;
float fontSize =
36.0f;
PDPage page = (PDPage)allPages.get( vo.getPageno() );
PDRectangle pageSize = page.findMediaBox();
float stringWidth = font.getStringWidth( vo.getMessage() )*fontSize/
1000f;
int rotation = page.findRotation();
boolean rotate = rotation ==
90 || rotation ==
270;
float pageWidth = rotate ? pageSize.getHeight() : pageSize.getWidth();
float pageHeight = rotate ? pageSize.getWidth() : pageSize.getHeight();
double centeredXPosition = rotate ? pageHeight/
2f : (pageWidth - stringWidth)/
2f;
double centeredYPosition = rotate ? (pageWidth - stringWidth)/
2f : pageHeight/
2f;
PDPageContentStream contentStream =
new PDPageContentStream(doc, page,
true,
true,
true);
contentStream.beginText();
contentStream.setFont( font, fontSize );
contentStream.setNonStrokingColor(
255,
0,
0);
if (rotate)
{
contentStream.setTextRotation(Math.PI/
2, centeredXPosition, centeredYPosition);
}
else
{
contentStream.setTextTranslation(centeredXPosition, centeredYPosition);
}
contentStream.drawString( vo.getMessage() );
contentStream.endText();
contentStream.close();
vo.setAfterPages(doc.getNumberOfPages());
doc.save( vo.getOutputfile() );
System.out.println(
"over");
}
finally
{
if( doc !=
null )
{
doc.close();
}
}
return vo;
}
—–8提取图片并保存
/**
* 提取图片并保存
* @param pdfDomainVO
* @throws IOException
*
*/
public static pdfDomainVO
extractImage(pdfDomainVO vo )
throws IOException{
PDDocument doc=
null;
try{
doc=PDDocument.load(vo.getInputfile());
/** 文档页面信息 **/
PDDocumentCatalog catalog = doc.getDocumentCatalog();
List pages = catalog.getAllPages();
int pageNum=pages.size();
PDPage page =
null;
if(vo.getPageno()!=
null){
page = ( PDPage ) pages.get( vo.getPageno() );
if(
null != page ){
PDResources resource = page.findResources();
Map<String,PDXObjectImage> imgs = resource.getImages();
for(Map.Entry<String,PDXObjectImage> me: imgs.entrySet()){
PDXObjectImage img = me.getValue();
img.write2file( vo.getOutputfile() + vo.getFilename()+
"-"+(vo.getPageno()+
1) );
}
}
}
else{
for(
int i =
0; i < pageNum; i++ ){
page = ( PDPage ) pages.get( i );
if(
null != page ){
PDResources resource = page.findResources();
Map<String,PDXObjectImage> imgs = resource.getImages();
for(Map.Entry<String,PDXObjectImage> me: imgs.entrySet()){
String count=(
int)(Math.random()*
1000)+
"-"+(
int)(Math.random()*
1000);
PDXObjectImage img = me.getValue();
img.write2file( vo.getOutputfile() + count );
}
}
}
}
vo.setAfterPages(doc.getNumberOfPages());
System.out.println(
"extractImage:over");
}
finally
{
if( doc !=
null )
{
doc.close();
}
}
return vo;
}
—–9PDF文档中删除页面(不能删除最后一页!)
/***
* PDF文档中删除页面
* 一个PDF文档必须至少有一页,且不能删除最后一页!
* @param inputFile
* @param outputFile
* @throws Exception
*/
public static pdfDomainVO
removePage(pdfDomainVO vo)
throws Exception
{
vo.setStatus(Details.FailStatus);
PDDocument document =
null;
try
{
document = PDDocument.load(vo.getInputfile() );
if( document.isEncrypted() )
{
throw new IOException(
"Encrypted documents are not supported for this example" );
}
if( document.getNumberOfPages() <=
1 )
{
throw new IOException(
"Error: A PDF document must have at least one page, " +
"cannot remove the last page!");
}
document.removePage( vo.getPageno() );
document.save(vo.getOutputfile() );
vo.setAfterPages(document.getNumberOfPages());
vo.setStatus(Details.SuccessStatus);
System.out.println(
"over");
}
finally
{
if( document !=
null )
{
document.close();
}
}
return vo;
}
pdfbox很强大,最主要是开源,(就是TMD不支持中文)以上只是部分功能,大家如果还想拓展,可以参考官方的事例和api
PS:遗憾的是,我没有处理好,替换文字或者是插入文字时,中文乱码问题,有处理好的同学记得和博主说一下,大家共同进步
这有一篇文:http://blog.csdn.net/undergrowth/article/details/39136673是对于pdfbox各个方法,属性解析的比较好的文,大家可以去看下