从python中的.txt文件读取特殊字符

问题描述：

此代码的目的是找到一本书中所用单词的频率.

The goal of this code is to find the frequency of words used in a book.

我想读一本书的文字，但以下行不断抛出我的代码:

I am tying to read in the text of a book but the following line keeps throwing my code off:

珍贵的protégés.不，先生们；他会一直给他们一个干净的配对

precious protégés. No, gentlemen; he'll always show 'em a clean pair

特别是é字符

我看过以下文档，但是我不太明白: https://docs.python.org/3.4/howto/unicode.html

I have looked at the following documentation, but I don't quite understand it: https://docs.python.org/3.4/howto/unicode.html

这里是我的代码:

import string
# Create word dictionary from the comprehensive word list 
word_dict = {}
def create_word_dict ():

  # open words.txt and populate dictionary
  word_file = open ("./words.txt", "r")
  for line in word_file:
    line = line.strip()
    word_dict[line] = 1

# Removes punctuation marks from a string
def parseString (st):
  st = st.encode("ascii", "replace")
  new_line = ""
  st = st.strip()
  for ch in st:
    ch = str(ch)
    if (n for n in (1,2,3,4,5,6,7,8,9,0)) in ch or ' ' in ch or ch.isspace() or ch == u'\xe9':

      print (ch)
      new_line += ch
    else:
      new_line += ""
  # now remove all instances of 's or ' at end of line
  new_line = new_line.strip()
  print (new_line)
  if (new_line[-1] == "'"):
    new_line = new_line[:-1]
  new_line.replace("'s", "")
  # Conversion from ASCII codes back to useable text
  message = new_line
  decodedMessage = ""
  for item in message.split():
    decodedMessage += chr(int(item))
  print (decodedMessage)
  return new_line

# Returns a dictionary of words and their frequencies
def getWordFreq (file):

  # Open file for reading the book.txt
  book = open (file, "r")

  # create an empty set for all Capitalized words
  cap_words = set()

  # create a dictionary for words
  book_dict = {}
  total_words = 0

  # remove all punctuation marks other than '[not s]
  for line in book:
    line = line.strip()
    if (len(line) > 0):
      line = parseString (line)

    word_list = line.split()

    # add words to the book dictionary
    for word in word_list:
      total_words += 1
      if (word in book_dict):
        book_dict[word] = book_dict[word] + 1
      else:
        book_dict[word] = 1
  print (book_dict)

  # close the file
  book.close()

def main():
  wordFreq1 = getWordFreq ("./Tale.txt")
  print (wordFreq1)

main()

我收到的错误如下:

Traceback (most recent call last):
  File "Books.py", line 80, in <module>
    main()
  File "Books.py", line 77, in main
    wordFreq1 = getWordFreq ("./Tale.txt")
  File "Books.py", line 60, in getWordFreq
    line = parseString (line)
  File "Books.py", line 36, in parseString
    decodedMessage += chr(int(item))
OverflowError: Python int too large to convert to C long

答

在python中打开文本文件时，默认情况下编码为ANSI，因此其中不包含échartecter.试试

When you open a text file in python, the encoding is ANSI by default, so it doesn't contain your é chartecter. Try

word_file = open ("./words.txt", "r", encoding='utf-8')

从python中的.txt文件读取特殊字符

相关推荐