2016-12-14 4 views
2

J'ai un 1.txt:python regex txt

I. Introduction to Text Mining 1 
I.1 Defining Text Mining 1 
I.2 General Architecture of Text Mining Systems 13 

II. Core Text Mining Operations 19 
II.1 Core Text Mining Operations 19 
II.2 Using Background Knowledge for Text Mining 41 
II.3 Text Mining Query Languages 51 

III. Text Mining Preprocessing Techniques 57 
III.1 Task-Oriented Approaches 58 
III.2 Further Reading 62 

IV. Categorization 64 
IV.1 Applications of Text Categorization 65 
IV.2 Definition of the Problem 66 
IV.3 Document Representation 68 

Je veux obtenir le résultat comme celui-ci:

I. Introduction to Text Mining 1.1 
    I.1 Defining Text Mining 1.1 
    I.2 General Architecture of Text Mining Systems 13.1 

II. Core Text Mining Operations 19.1 
    II.1 Core Text Mining Operations 19.1 
    II.2 Using Background Knowledge for Text Mining 41.1 
    II.3 Text Mining Query Languages 51.1 
... 

deux changements:

1. I I.1 use the TAB. 
2. all number in the end plus 0.1 

J'essayez d'utiliser pandas géants mais ce n'est pas du travail. J'essaie autrement, mais je ne sais pas comment écrire le prochain programme:

# -*- coding: utf-8 -*- 
import re 

f=open("D:/Downloads/1.txt") 
page_list = [] 
content=[] 
for line in f: 
    if re.search('(\d+)$',line) !=None: 
     page_list.append(re.search('(\d+)$',line).group()) 
    if re.search('^(.*\.\d+)',line) !=None: 
     content.append(re.search('^(.*\.\d+)',line).group()) 
str=map(lambda x:x+'.1',page_list) 
print str 
con=map(lambda x:'\t'+x,content) 
print con 

le résultat du programme:

['1.1', '1.1', '13.1', '19.1', '19.1', '41.1', '51.1'] 
['\tI.1', '\tI.2', '\tII.1', '\tII.2', '\tII.3'] 
+0

Si vous donnez un peu de temps je vais coder pour toi. Pouvez-vous s'il vous plaît coller votre fichier d'entrée entier ou donner la limite supérieure de vos nombres romains? – user902384

+0

@Shiv oui. J'ai changé le code, le premier code est 1.txt. Merci! – pang2016

Répondre

2

Vous pouvez essayer ceci:

(.*)(\d+) 

Et remplacer par:

\1\2.1 

Explanation

Exemple de code:

import re 

regex = r"(.*)(\d+)" 

test_str = ("I. Introduction to Text Mining 1\n" 
    "I.1 Defining Text Mining 1\n" 
    "I.2 General Architecture of Text Mining Systems 13\n\n" 
    "II. Core Text Mining Operations 19\n" 
    "II.1 Core Text Mining Operations 19\n" 
    "II.2 Using Background Knowledge for Text Mining 41\n" 
    "II.3 Text Mining Query Languages 51\n\n" 
    "III. Text Mining Preprocessing Techniques 57\n" 
    "III.1 Task-Oriented Approaches 58\n" 
    "III.2 Further Reading 62\n\n" 
    "IV. Categorization 64\n" 
    "IV.1 Applications of Text Categorization 65\n" 
    "IV.2 Definition of the Problem 66\n" 
    "IV.3 Document Representation 68\n\n") 

subst = "\\1\\2.1" 


result = re.sub(regex, subst, test_str, 0, re.MULTILINE) 

if result: 
    print (result) 

Exemple de sortie:

I. Introduction to Text Mining 1.1 
I.1 Defining Text Mining 1.1 
I.2 General Architecture of Text Mining Systems 13.1 

II. Core Text Mining Operations 19.1 
II.1 Core Text Mining Operations 19.1 
II.2 Using Background Knowledge for Text Mining 41.1 
II.3 Text Mining Query Languages 51.1 

III. Text Mining Preprocessing Techniques 57.1 
III.1 Task-Oriented Approaches 58.1 
III.2 Further Reading 62.1 

IV. Categorization 64.1 
IV.1 Applications of Text Categorization 65.1 
IV.2 Definition of the Problem 66.1 
IV.3 Document Representation 68.1 
+0

Que dois-je faire pour le retrait TAB du début de I.1 I.2 III.1 III.2 ..... IV.3 – pang2016

+0

plz regardez votre sortie d'échantillon, et modifiez cela si le changement nécessite –

0

Pour TAB tiret, vous pouvez utiliser ce code:

for m in re.finditer(r'[IVX]+\.\d.*\n',result): 
    oldgroup = m.group() 
    newgroup = '\t' + oldgroup 
    result = re.sub(oldgroup,newgroup,result)