I am absolutely new to AI/ML and need some guidance/direction.

Every “New to AI, try this” guide I find ends up going down a path that isn’t right for the project I’m working on - or convoluted with so many terms I need to look up, I get rather frustrated. Maybe I’m too old to learn/use AI? Anyway . . .

This is my project, and any guidance, pointers, help would be super appreciated. I’m working on a job aggregator. I have a simple web crawler that goes to a url, fetches the HTML, cleans a lot of the text and structure, and outputs the content of the job posting.

I then go in manually, look at that simplified HTML and extract the actual job description (vs Company description, benefits, other stuff on a job posting) to be used in another database. I use the exact wording, straight copy and paste, no summarization or interpretation.

I have about 400 data points in a database that look like this: job_site: “COMPANY_NAME”, raw_html: “<h1>Job Title</h1><p>This is what we do</p><p>We are looking for someone who</p>” job_description: “We are looking for someone who” That I’ve manually extracted. I feel like I can use that as training data to do some form of text . . . extraction ?? . . . from an html document. But I don’t have any clue on where to start

  • namnnumbr
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Look into beautiful soup (bs4) for parsing html web pages. Once you have that cleaned, you should be able to do any kind of NLP modeling over it.

    Both Huggingface and SpaCy have pretty good tutorials/walkthroughs for tokenizing your data, doing entity extraction, etc.

    That said, it would be helpful for you to figure out what you want to do with your project — for example, you could try to identify keywords relevant to job tiles, you could try to use similarity metrics to recommend new roles based on the current one, etc.