Hello - I'm Mannie a junior coding craftsman

(Mannie Gill) #1

Sorry I should of done this before but just popping in to introduce myself.

My name is Mannie and I have decided to make a career change from Windows system admin to coding full time :blush:

I live near West Bromwich but I’m a Manchester United fan not sure how that worked out.

I have a pet project that I have worked on for the last 2 years: www.jobveta.co.uk

Hope to meet some of you at the local meetups :slight_smile:

(Stuart Langridge) #2

Welcome to the funhouse!

(Andy Wootton) #3

Hi Mannie,

I was interested in your use of “craftsman”. I’ve been worrying about how software people should see themselves for a long time. I chose a career in data processing, studied computer science, became a programmer, had ambitions to be a software engineer etc. This guy took it a step further. I don’t quite agree with his conclusion but I found his analysis interesting http://www.codeproject.com/Articles/1130886/The-Software-Development-Process-Science-Engineeri

(Jon) #4

Welcome, Mannie.

That’s a nice-looking job website you have there. However I tried finding an actual job on it, and couldn’t find a single one - is it populated? I am working on a project in this space as well (though on a spare-time basis presently).

I was a bit confused by the “how it works” - the images are from a site called “Zapier”, and don’t seem to match the tasks described?

If it’s not secret sauce, what sites are you scraping from? Do you run your own robots?

(Mannie Gill) #5

Thanks for the warm welcome guys :slight_smile:

@Woo That’s an interesting article indeed, but from my limited experience I think it comes down to how we design the implementation of our code. Especially referring to OOP and the patterns we apply. Also, how different developers approach the same problem intrigues me. The use of the word “craftsman” fits in with how I see my code at the moment (may change in time) but writing code that is extendable and which follows SOLID is a skill indeed.

@halfer ah yes, that would be my fault for leaving the crawlers disabled after updating the server. I have enabled the crawlers now so jobs are being spidered and indexed: http://www.jobveta.co.uk/jobs?q=software+developer&l=

The site is still in development mode at the moment and I just added the zapier images I found online.

So in regards to the scraping I use C# with a premium scraper API and Redis to store crawled urls and MySQL for the core data.

The biggest issue I have had with crawling such a large amount of “unclean” data is duplication of jobs (Similarity checking) and duplication city/town names i.e. Birmingham in the UK and the US equivalent. For the latter I had to use LibPostal which has improved in indexing the correct data.

Are you looking to crawl data?

(Jon) #6

They’re back! Thanks.

Yes, that’s the basis of my project. The data on the prototype is stale as I’ve disabled my crawlers too whilst I am working on some stability fixes, but it’ll give you a general idea. My UI isn’t great, but it’s just a proof of concept.

I use … a premium scraper API

Yeah, I was thinking of going down that route, e.g. Import.io, but it felt like a risk that a scraping partner could change T&Cs, prices, etc - so I built my own. It’s 90% there, but it isn’t quite production ready yet!

How often do you get to work on your site? I try to touch it a few times a week, but I often get distracted on other side projects! :smiley:

(Mannie Gill) #7

I joined the Makers Academy bootcamp in July so did not have anytime for 3 months. Will hopefully launch the site in beta in about a month. Got some bugs to fix first :smiling_imp:

What language did you build the crawlers in? And how have you found crawling different sites daily with site design/structure changes happening (breaking crawlers) < this is alot of manual work at the moment fixing broken crawlers even with best XPATH patterns.

(Jon) #8

Good questions! The system is mostly PHP - I just used what I am most comfortable with. The library ecosystem for PHP is great, but I’m not doing anything that Ruby/Python/Node couldn’t do.

This is the biggest challenge, but I think it is surmountable. I’m using a headless lightweight browser and a scraping engine that allows me to build scrapers using a web interface. It allows CSS and XPath expressions, plus finding links, clicking them, simple loop variables, etc. It is fairly stable across ~50 small employers - people don’t often change their sites, and of the ones that hand-write their broken links, there’s a point where you just have to drop them as a scrape target.

The best tip I have so far is to make your crawlers easily and quickly editable, so you can keep them up to date.

I have some time off coming up, so maybe I will get a chance to do some more work on it. Good luck with your bugs! :crab:

(Marc Cooper) #9

Greetings, @mannieg. Any ManU fan is most welcome :slight_smile:

On the scraping front: I recently had a play with elixir’s scrape which can rip through a lot of stuff by exploiting concurrency. Haven’t hooked that up with postgresql’s latest 9.6 phrase search, though. Fun stuff to be had there, I suspect.