IT Knowledge Base

~ Without sacrifice, there can be no victory ~

發佈日期:

如何在Microsoft Word中‧將PDF檔案內容儲存為文字(text)檔案

01. 今天遇到一個難題,有一大堆PDF檔案(接近50個PDF檔案),需要將內裡內容取出來再作下一步分析。好消息是,PDF檔案內容是一埋可複製的文字,而不是圖像。壞消息是,手上卻沒有像Adobe Acrobat對PDF檔案編輯軟件。

02. 在網上查找一下,原來Word可以打開PDF檔案再作編輯。而我想到的方法,是將PDF檔案內容,以Text檔案方式儲存,再想辦法取出需要的內容。

03. 下一步,就是如何利用Word VBA,做到我想要的結果。

Sub pdf_to_textfile()
' Stop any warning message during WORD saves to text file.
Application.DisplayAlerts = wdAlertsNone

' Define working file path and file name.
Dim folder_path As String
Dim file_name As String

' Define new document for text file.
Dim new_document As Document

' Assume all PDF files are stored in this folder.
folder_path = "C:\temp\PDF folder\"

' Get all names of working files.
file_name = Dir(folder_path & "*.pdf")
Do While file_name <> ""

' Set working folder.
ChangeFileOpenDirectory folder_path

' Open PDF file one by one
Documents.Open filename:=file_name, ConfirmConversions:=False, ReadOnly:=False, AddToRecentFiles:=False, PasswordDocument:="", PasswordTemplate:="", Revert:=False, WritePasswordDocument:="", WritePasswordTemplate:="", Format:=wdOpenFormatAuto, XMLTransform:=""

' Select all content inside PDF file.
Selection.WholeStory

' Copy all content into clipboard.
Selection.Copy

' Create a new document in WORD.
Set new_document = Documents.Add

' Paste clipboard into new document.
new_document.Content.Paste
 
' Save new document into text file with encoding 65001, UTF8.
ActiveDocument.SaveAs2 filename:=file_name + ".txt", FileFormat:=wdFormatText, Encoding:=65001, LockComments:=False, Password:="", AddToRecentFiles:=True, WritePassword:="", ReadOnlyRecommended:=False, EmbedTrueTypeFonts:=False, SaveNativePictureFormat:=False, SaveFormsData:=False, SaveAsAOCELetter:=False, InsertLineBreaks:=False, AllowSubstitutions:=False, LineEnding:=wdCRLF, CompatibilityMode:=0

' Close new document.
ActiveWindow.Close

' Close PDF file
Documents.Close
file_name = Dir()
Loop
End Sub

發佈留言

發佈留言必須填寫的電子郵件地址不會公開。 必填欄位標示為 *