Skip to content Skip to sidebar Skip to footer

Python BeautifulSoup - Scraping Div Spans And P Tags - Also How To Get Exact Match On Div Name

I have two divs I am trying to scrape, with the same name (but there are other divs on the page also with a partial name match, that I dont want). The first I just need the text i

Solution 1:

Your main question is how to extract the text from <p> , which does not contain <span>.

NavigableString A string corresponds to a bit of text within a tag. So you can extract text if they are instances of NavigableString

from bs4 import BeautifulSoup,NavigableString
html = "your example"

soup = BeautifulSoup(html,"lxml")
for e in soup.find("p"):
    print(e,type(e))
#Name:  <class 'bs4.element.NavigableString'>
#<strong><span itemprop="name">Alisson Ramses Becker</span></strong> <class 'bs4.element.Tag'>

Real code:

resultset = soup.find_all("p")
maintext = []
for result in resultset:
    for element in result:
        if isinstance(element, NavigableString):
            maintext.append(element)

print(maintext)
# ['Name: ', 'Date of birth:', 'Place of birth:', 'Club: ', 'Squad: 13', 'Position: Goal Keeper']

Equal to

[element for result in resultset for element in result if isinstance(element, NavigableString)]

My full test code

from bs4 import BeautifulSoup,NavigableString
html = """

    <div class="row-table details -bp30">
        <div class="col">
            <p>Name: <strong><span itemprop="name">Alisson Ramses Becker</span></strong></p>                <p>Date of birth:<span itemprop="birthDate">02/10/1992</span></p>                <p>Place of birth:<span itemprop="nationality"> Brazil</span></p>               
                        </div>
        <div class="col">
            <p>Club: <span itemprop="affiliation">Liverpool</span></p><p>Squad: 13</p>                <p>Position: Goal Keeper</p>
        </div>
    </div>
"""
soup = BeautifulSoup(html,"lxml")
resultset = soup.find_all("p")
fr = [element for result in resultset for element in result if isinstance(element, NavigableString)]
spanset = [e.text for e in soup.find_all("span",{"itemprop":True})]
setA = ["".join(z) for z in zip(fr,spanset)]
final = setA + fr[len(spanset):]
print(final)

Output

['Name: Alisson Ramses Becker', 'Date of birth:02/10/1992', 'Place of birth: Brazil', 'Club: Liverpool', 'Squad: 13', 'Position: Goal Keeper']

Solution 2:

Assuming you have rights to scrap this site and there are no APIs or json returns, one slow way to do it is:

from bs4 import BeautifulSoup as bs

html = '''
 <div class="row-table details -bp30">
        <div class="col">
            <p>Name: <strong><span itemprop="name">Alisson Ramses Becker</span></strong></p>                <p>Date of birth:<span itemprop="birthDate">02/10/1992</span></p>                <p>Place of birth:<span itemprop="nationality"> Brazil</span></p>               
                        </div>
        <div class="col">
            <p>Club: <span itemprop="affiliation">Liverpool</span></p><p>Squad: 13</p>                <p>Position: Goal Keeper</p>
        </div>
    </div>
'''

soup = bs(html,'html5lib')

data = [d.find_all('p') for d in soup.find_all('div',{'class':'col'})]

value = []
for i in data:
    for j in i:
        value.append(j.text)

print(value)

Post a Comment for "Python BeautifulSoup - Scraping Div Spans And P Tags - Also How To Get Exact Match On Div Name"