python - BeautifulSoup to extract URLs (same URL repeating) -


i've tried using beautifulsoup , regex extract urls web page. code:

ref_pattern = re.compile('<td width="200"><a href="(.*?)" target=') ref_data = ref_pattern.search(web_page) if ref_data:     ref_data.group(1) data = [item item in csv.reader(output_file)] new_column1 = ["reference", ref_data.group(1)] new_data = [] i, item in enumerate(data):     try:         item.append(new_column1[i])     except indexerror, e:         item.append(ref_data.group(1)).next()     new_data.append(item) 

though has many urls in it, repeats first url. know there's wrong

except indexerror, e:     item.append(ref_data.group(1)).next() 

this part because if remove it, gives me first url (without repetition). please me extract urls , write them csv file. thank you.

although it's not entirely clear you're looking for, based on you've stated, if there specific elements (classes or id's or text, instance) associated links you're attempting extract, can following:

from bs4 import beautifulsoup string = """\         <a href="http://example.com">linked text</a>         <a href="http://example.com/link" class="pooper">linked text</a>         <a href="http://example.com/page" class="pooper">image</a>         <a href="http://anotherexmpaple.com/page">phone number</a>"""  soup = beautifulsoup(string)  link in soup.findall('a', { "class" : "pooper" }, href=true, text='linked text'):     print link['href'] 

as can see, using bs4's attribute feature select anchor tags include "pooper" class (class="pooper"), , further narrowing return values passing text argument (linked text rather image).

based on feedback below, try following code. let me know.

for items in soup.select("td[width=200]"):     link in items:         link.findall('a', { "target" : "_blank" }, href=true)         print link['href'] 

Comments

Popular posts from this blog

google api - Incomplete response from Gmail API threads.list -

Installing Android SQLite Asset Helper -

Qt Creator - Searching files with Locator including folder -