In python there is a module named re, stands for Regular Expression. I tried to use it in the project. Because the content of each lesson in code is about 10 to 40 lines, but the whole source file is a about 700 lines. So I need to extract the content to make it as normal article, easy to read, easy to translate too.
First, write the regular expression to match the content what I want:
p = re.compile(r'(?<= text=").+(?=" x=")')
then read & match them sentence by sentence:
for line in fin.readlines():
s_match = re.findall(p,line)
And write them to a new file:
if s_match:
fout.write(s_match[0]+ " ")
The new file (fout) has the whole content of this lesson which I need to translate. No matter the content is English or Chinese, it can show properly.
re.findall gets a list of strings, so if you want to use the string, just use the index of the list. If there is more than one string, and you want to read them one by one, you can use for … in s_match to get all of them.
Remember: for other languages (not English), if you read the list, you may find the strings in it were encoded in other format (not ASCII). If you want to get the string, do not use str(s_match), otherwise, you will get the encoding of the characters, such as: \xc4\x87\x56\x82.......
Read the real string elements in the list, not just convert the list to string.
No comments:
Post a Comment