정규식

티스토리 뷰

데이터 분석/ML

정규식

Hhhh8 2018. 11. 9. 10:11

4 정규식


from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))

정규표현식 (Regular Expressions)

목적 : 데이터 전처리 과정에서 데이터를 정제, 일반 전처리
복잡한 문자열을 처리할 때 사용
응용 : 한글제거, 한자제거, 숫자제거, 알파벳제거, 공백제거 등등..
전방탐색(긍정/부정)

수행코드

re.compile() 로 패턴객체 생성
방법1
p = re.compile(정규식)
m = p.match( 문자 )
방법2
m = re.match( 정규식, 문자 )

method

method	설명	return
match()	문자열의 처음부터 정규식에 매치되는지 조사, 1개만찾음	매치하면 match객체 매치하지않으면 None을 리턴
search()	문자열의 전체. 정규식에 매치되는지 조사, 1개만찾음	매치하면 match객체 매치하지않으면 None을 리턴
findall()	정규식과 매치되는 모든 문자열(substring)	리스트
finditer()	정규식과 매치되는 모든 문자열(substring)	iterator 객체
sub()	규칙과 매치되는 부분을 다른 문자열로 바꾸거나 제거

이스케이프 문자

자주 사용하는 정규식을 별도의 표기법으로 표현한 것

이스케이프 문자	설명
\d	= [0-9]
\D	= [^0-9]
\s	= [ ,\t\n\r\f\v]
\S	= [^ ,\t\n\r\f\v]
\w	= [a-zA-Z0-9]
\W	= [^a-zA-Z0-9]
\A	시작 체크
\Z	마지막 체크
\b	단어구분자
\r	raw string

메타문자

원래 그 문자가 가진 뜻이 아닌 특별한 용도로 사용되는 문자
※ 주의 [.] : 모든문자가 아닌 온점을 뜻함

메타문자	내용
.	모든문자, 단 \n은 제외
^	문자열의 시작 문자 클래스 안에서 ^는 반대의미!
$	문자열의 맨 끝
*	앞 표현을 0~무한번 반복
+	앞 표현을 1~무한번 반복
{}	반복, {m,n} m번부터 n번까지 반복 {2}2번반복 {1,}=1번이상반복 {,3}=3번이하 반복
?	{0,1}=0번 이상 1번이하 반복 =있거나 없거나
()	grouping
[]	문자클래스
파이프	= or

컴파일옵션

컴파일옵션	설명
DOTALL(S)	\n 와도 매치됨
MULTILINE(M)	여러줄매치
VERBOSE(X)	정규식 안에서 주석 사용 가능
IGNORECASE(I)	대소문자 구분하지않고 매치

전방탐색(긍정/부정)

	컴파일옵션	설명
긍정	(?=...)	해당되는 정규식과 매치되어야 조건 통과
부정	(?!...)	해당되는 정규식과 매치되지않아야 조건 통과


xxxxxxxxxx
import re


xxxxxxxxxx
if re.match('[a.b]', 'a'):print('1 일치한다')
if re.match('[a.b]', 'ab'):print('2 일치한다')
if re.match('[a.b]', 'a1b'):print('3 일치한다')    
if re.match('[a.b]', 'abc'):print('4 일치한다')  
if re.match('[a.b]', 'bca'):print('5 일치한다') 
if re.match('[a.b]', 'cab'):print('6 일치한다') 
if re.match('a.b', 'abc'):print('7 일치한다')
if re.match('a.b', 'a1b'):print('8 일치한다')
# [안에 있는 문자가 ] match()에 있기만 하면 일치 .은 .이다 그냥


xxxxxxxxxx
1 일치한다
2 일치한다
3 일치한다
4 일치한다
5 일치한다
8 일치한다


x
p     = re.compile('^애슐리\s\w+')
data  = '''애슐리 하면 부산대역점 
이라고 생각하겟지만, 서면역에 있는 
애슐리 점이 더 맛있다.
'''
print( p.findall(data) )
p     = re.compile('^애슐리\s\w+', re.MULTILINE)
data  = '''애슐리 하면 부산대역점 
이라고 생각하겟지만, 서면역에 있는 
애슐리 점이 더 맛있다.
'''
print( p.findall(data) )
if re.match('[^0-9]', '1'):print('1 일치한다')
if re.match('[^0-9]', 'a'):print('2 일치한다')


xxxxxxxxxx
['애슐리 하면']
['애슐리 하면', '애슐리 점이']
2 일치한다


xxxxxxxxxx
if re.match('^[0-9]+[a-z]+$', '1111111'):print('1 일치한다')
if re.match('^[0-9]+[a-z]+$', '1111111a'):print('2 일치한다')
if re.match('^[0-9]+[a-z]+$', '1111111A'):print('3 일치한다')


xxxxxxxxxx
2 일치한다


xxxxxxxxxx
if re.match('bus{2}an', 'buan'):print('1 일치한다')
if re.match('[bus{2}an]', 'buan'):print('2 일치한다')    
if re.match('bus{2}an', 'busan'):print('3 일치한다')    
if re.match('bus{2}an', 'bussan'):print('4 일치한다') 
if re.match('bus{2}an', 'busssan'):print('5 일치한다')
if re.match('bus{2,3}an', 'bussan'):print('5-1 일치한다')
if re.match('bus{2,3}an', 'busssan'):print('6 일치한다')
if re.match('bus{2,3}an', 'bussssan'):print('7 일치한다')    
if re.match('bus?an', 'buan'):print('8 일치한다')
if re.match('bus?an', 'busan'):print('9 일치한다')
if re.match('bus?an', 'bussan'):print('10 일치한다')


xxxxxxxxxx
2 일치한다
4 일치한다
5-1 일치한다
6 일치한다
8 일치한다
9 일치한다


xxxxxxxxxx
if re.match('[abc]', 'a'):print('1 일치한다')
if re.match('[abc]', 'edf'):print('2 일치한다')
if re.match('[abc]', 'abd'):print('3 일치한다')
if re.match('[abc]', 'b12'):print('4 일치한다')


xxxxxxxxxx
1 일치한다
3 일치한다
4 일치한다


xxxxxxxxxx
if re.match('a|b|c', 'a'):print('1 일치한다')
if re.match('a|b|c', 'b'):print('2 일치한다')
if re.match('a|b|c', 'c'):print('3 일치한다')
if re.match('a|b|c', 'd'):print('4 일치한다')
if re.match('[abc]', 'a'):print('5 일치한다')


xxxxxxxxxx
1 일치한다
2 일치한다
3 일치한다
5 일치한다


xxxxxxxxxx
if re.match('[\d]', '1'):print('1 일치한다')
if re.match('\d', '1'):print('2 일치한다')
if re.match('[\d]', 'a'):print('3 일치한다')
if re.match('\d', 'a'):print('4 일치한다')    
if re.match('[\D]', '1'):print('5 일치한다')
if re.match('\D', '1'):print('6 일치한다')
if re.match('[\D]', 'a'):print('7 일치한다')
if re.match('\D', 'a'):print('8 일치한다')   
if re.match('\D', '가'):print('9 일치한다')   
if re.match('\d', '가'):print('10 일치한다')


xxxxxxxxxx
1 일치한다
2 일치한다
7 일치한다
8 일치한다
9 일치한다


xxxxxxxxxx
if re.match('[\s]', ' \t\n\r\f\v'):print('1 일치한다')
if re.match('[\S]', ' \t\n\r\f\v'):print('5 일치한다')
if re.match('[\S]', 's \t\n\r\f\v'):print('6 일치한다')


xxxxxxxxxx
1 일치한다
6 일치한다


xxxxxxxxxx
if re.match('[\w]', 'a0'):print('1 일치한다')
if re.match('[\W]', 'a0'):print('2 일치한다')
if re.match('[\W]', '가'):print('3 일치한다')
if re.match('\W', '가'):print('4 일치한다')


xxxxxxxxxx
1 일치한다

search : 1개만찾음, findall : 모두찾음

정규식은 (123)+ 로 123을 반복하는 세트
첫 번째 return은 (4,7)의 123만 있다 → 일치하는게 여러개 있더라도 search는 1개만 리턴
세 번째 findall의 결과로 123 123123 123123 이 나올줄 알았는데 결과는 123,123,123 이다


xxxxxxxxxx
# 3이 반복, 1, 2는 1회만 
# p = re.compile( '123+' )
# 1,2,3 세트를 반복
p = re.compile( '(123)+' )
print( p.search('312 123 123123 123123132 GoodJob?'))
print( p.search('312 123123 123 123123132 GoodJob?'))
print( p.findall('312 123 123123 123123132 GoodJob?'))


xxxxxxxxxx
<_sre.SRE_Match object; span=(4, 7), match='123'>
<_sre.SRE_Match object; span=(4, 10), match='123123'>
['123', '123', '123']

아래와 같은 패턴 객체를 만들어보자!

kim 010-2222-9090


xxxxxxxxxx
p = re.compile('^\w+\s+\d{3}[-]\d{4}[-]\d{4}$')
m = p.search('kim 010-2222-9000')
m


xxxxxxxxxx
<_sre.SRE_Match object; span=(0, 17), match='kim 010-2222-9000'>

그룹핑

group(수치)	설명
group(0)	매치된 전체 문자열
group(1)	첫번째 그룹
group(2)	두번째 그룹
group(n)	n번째 그룹

이름과 전화번호를 구분하고싶으면 위 상태에서는 다시 문자열을 처리해야한다
구분하기 편하도록 그룹핑을 해보자

그룹핑은 정규식을 ()로 감싸기만 하면된다


xxxxxxxxxx
p = re.compile(r'^(\w+)\s+((\d{3})[-]\d{3,4}[-]\d{4})$')
m = p.search('kim 010-2222-9000')
print( m.group(),'\n', m.groups(),'\n',m.group(1),'\n',m.group(2),'\n',m.group(3))


xxxxxxxxxx
kim 010-2222-9000 
 ('kim', '010-2222-9000', '010') 
 kim 
 010-2222-9000 
 010

그룹을 재지정(재사용) : \g<1> \g<2>

그룹을 부르기위해 인덱스를 쓸 수도 그룹명을 지정할 수도 있다
아래 코드를 보면 pandas numpy 가 match 객체로 나올줄알았는데
numpy numpy 가 나왔다 -> 똑같은 문자를 반복한다는 의미


xxxxxxxxxx
p = re.compile(r'(\b\w+)\s+\1') # \1 : 첫 번째 그룹을 반복
m = p.search('pandas numpy numpy scikit')
m.group()


xxxxxxxxxx
'numpy numpy'


xxxxxxxxxx
# 주민번호 필터링 
data = '''
MBC 990101-1234567
KBS 990101-2345678
'''
p =re.compile(r'^(\w+)\s+(\d{6})[-]([1-4])\d{6}$',re.MULTILINE)
m = p.search(data)
# \g<1> : 그룹1번
print(p.sub('\g<1> \g<2>-\g<3>******', data ))


xxxxxxxxxx
MBC 990101-1******
KBS 990101-2******

그룹에 이름붙이기

(?P<그룹명>...)


xxxxxxxxxx
p = re.compile(r'^(?P<username>\w+)\s+((\d{3})[-]\d{3,4}[-]\d{4})$')
m = p.search('kim 010-2222-9000')
print( m.groups(),'\n', 'username : ',m.group('username'))

('kim', '010-2222-9000', '010') 
 username :  kim

findall, finditer

findall 은 list를 리턴하기때문에 search나 match처럼 group 메소드를 사용할수 없다 → for문 사용
finditer 은 callable_iterator를 리턴, 역시 group 메소드는 사용할 수 없으나
for문으로 객체 하나하나에 대해선 group메소드를 쓸 수 있다


xxxxxxxxxx
p = re.compile('[a-z]+')
r = p.findall('i am a Good Boy')
for w in r:
    print(w)

i
am
a
ood
oy


xxxxxxxxxx
p = re.compile('[a-z]+')
r = p.finditer('i am a Good Boy')
for w in r:
    print(type(w),w.group(),w.span())
r.group() # 객체에 바로 group 쓸 수 없다

<class '_sre.SRE_Match'> i (0, 1)
<class '_sre.SRE_Match'> am (2, 4)
<class '_sre.SRE_Match'> a (5, 6)
<class '_sre.SRE_Match'> ood (8, 11)
<class '_sre.SRE_Match'> oy (13, 15)

---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-29-eced1f82ca20> in <module>()
      3 for w in r:
      4     print(type(w),w.group(),w.span())
----> 5 r.group()

AttributeError: 'callable_iterator' object has no attribute 'group'

sub

한글 제거
한글을 숫자로 변환, 그 값이 연속적이다

ㄱ~ㅎ : 자음 0x3131
ㅏ~ㅣ
가~힣 : 완성문자(유니코드(2byte),euc-kr) vs 완성형코드(1byte),utf-8

영문 제거
영어를 숫자로 변환, 그 값이 연속적이다


xxxxxxxxxx
p = re.compile('(blue|white|red)')
print(p.sub('색상','blue is a hat and white hand'))

색상 is a hat and 색상 hand


xxxxxxxxxx
p = re.compile('(blue|white|red)')
print(p.sub(' ','blue is a hat and white hand'))

  is a hat and   hand


xxxxxxxxxx
p =re.compile('[ㄱ-ㅣ가-힣\"\;\:\[\]\(\)\.\,]+') #한글+마침표들
p1 = re.compile('[^ㄱ-ㅣ가-힣]+')                #한글제외
txt = '''
한글; The English Wikipedia\ is the English-language edition 
of the free online encyclopedia Wikipedia. 
'''
m = p.sub('', txt)
r = p1.sub('', txt)
print(m,'\n',r)

 The English Wikipedia\ is the English-language edition 
of the free online encyclopedia Wikipedia 
 
 한글


xxxxxxxxxx
p = re.compile(r'\bBusan\b')
p1 = re.compile(r'\sBusan\s')
p2 = re.compile(r'\BBusan\B')
m = p.search('city Busan dickeslc')
r = p1.search('city Busan dickeslc')
a = p1.search('city Busan dickeslc')
print('\\b :',[m.group()])
print('\s :',[r.group()])
print('\B :',[a.group()])

\b : ['Busan']
\s : [' Busan ']
\B : [' Busan ']

전방탐색(긍정/부정)

	컴파일옵션	설명
긍정	(?=...)	해당되는 정규식과 매치되어야 조건 통과
부정	(?!...)	해당되는 정규식과 매치되지않아야 조건 통과


xxxxxxxxxx
p = re.compile(r'.+(?=:)')
m = p.search('http://m.naver.com')
print(m.group())

http


xxxxxxxxxx
# 다음 열거한 파일명 중 확장자가 py, msi 인 파일은 제외하고 
# 나머지 파일명만 보여라
# 파일명 정규식
p = re.compile(r'.*[.](?!py$|msi$).*$')
print(p.search('a.com'))
print(p.search('a.py'))
print(p.search('a.msi'))

<_sre.SRE_Match object; span=(0, 5), match='a.com'>
None
None


xxxxxxxxxx
# 이메일 정규식
email = '''
sif12@naver.com
bbbb@google.com
'''
p = re.compile(r'^\w+@.+[.]+.+$', re.MULTILINE)
print( p.search(email) )

<_sre.SRE_Match object; span=(1, 16), match='sif12@naver.com'>

'데이터 분석 > ML' 카테고리의 다른 글

범주형데이터 인코딩 OneHotEncoder, get_dummies 의 차이점 (0)	2019.09.03
SVM 기초 빠르게 훑어보기 (0)	2019.06.19
문자열 텍스트 프레임 분류하기 (0)	2019.05.07
알고리즘 체인과 파이프라인 (0)	2018.11.07

공지사항

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

링크

TAG more

« 2025/02 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

글 보관함

공부하자 혜진아

티스토리 뷰

정규식

정규표현식 (Regular Expressions)

수행코드

method

이스케이프 문자

메타문자

컴파일옵션

전방탐색(긍정/부정)

search : 1개만찾음, findall : 모두찾음

아래와 같은 패턴 객체를 만들어보자!

그룹핑

그룹을 재지정(재사용) : \g<1> \g<2>

그룹에 이름붙이기

findall, finditer

sub

전방탐색(긍정/부정)

'데이터 분석 > ML' 카테고리의 다른 글

티스토리툴바