birmingham.io

Getting text from Twitter and making it accessible to a Python app

Go with python-twitter - it works, move on :slight_smile:

1 Like

You certainly can use a queue of your choice; that’s a pretty standard way to do this (you are literally queueing up the tweets for display). Me, personally, I’d write each tweet as a separate file in the directory, with filenames that increase alphabetically (you could do this by time, or (I think) use the tweet IDs themselves). That way you don’t need to care about file locking, and maintaining your place in the “queue” is easy; if babble crashes, just start again from the first file.

2 Likes

That’s an even better idea - I guess I’d just dump the contents of the json into the file, allowing for even less faffing about.

1 Like

+1 for that approach. Maybe add a timestamp to the filenames, just to be safe.

In theory queuing tools are simple, but inevitably the complication creeps in. Processes need to kept running, so you add supervisord. Security needs to be configured, endpoints need to be named correctly and gems/npms need to be installed, so you end up with config files for each environment and ansible files for deploying everything you need. Then you write bash scripts querying the queue and diagnosing issues, purging the queue, sending a test message etc.

OTOH, files and folders just works. The ‘service’ (i.e. the OS) is always running, API support is mature and well documented in every language, it preserves state during shutdown really well, and you get a bunch of diagnostic tools for free (like ‘ls’ and ‘cat’).

1 Like

I also like the idea that you can have multiple folders to represent status, even if only a backlog and a processed - should we want to keep a record of what we’ve already displayed. I’m not sure if there’s much need for that, but it’s an option.

This is interesting. Where folk are seeing simplicity in files and folders, I’m seeing unnecessary complexity, Where folk are seeing complexity in queues, I’m seeing simplicity. (I speak as someone who cut his teeth on COBOL file-based batch systems.)

It’s true that files and folders just work, and so do queues.

1 Like

That’s just what i was thinking. We should definitely meet up @auxbuss :slight_smile:

I think this is interesting too. I started with card-image files and all my programming experience was on VMS which had a record-based file-system. The thing that put me off my Unix pipes idea (implemented as hidden files, I think?) was the Unix bit-stream not having automatic record structuring, like queues do. Are we seeing different things as ‘the natural way to do things’? I’m having a similar problem getting my head around the cultural differences of functional programming.

I don’t like the idea of tree structured status because that only allows binary choices. I’ve just seen seen a relationship between flags and hash-tags for the first time.

I’ve realised that last bit might not make much sense to someone who never met a dinosaur. VMS system calls loved passing parameters as bitmaps. You could pass 32 flags in a single ‘long-word’. Procedure calls were expensive, compared to Unix, because better :slightly_smiling: . Hash-tags effectively allow you to be in 2 branches of a tree at once, like in Gmail, just as bits could be ‘chunked’ to pass 2, 4, 8… bits at once, as values or masks to choose values, such as system, group, owner, world security, for read, write, execute or delete, in a 16 bit word.

For this specific example, I think the necessary extra effort to run a queue at all (how does it restart on reboot? On crash? Does it keep all queued items when crashing? What if the process hangs but doesn’t crash?) outweighs the minimal benefits you get from it; that’s why I recommended it. For a different example (for instance, if you needed multiple queue consumers) then a queue would be loads better (if you use the filesystem for that then you end up reinventing a queue). But this isn’t really a queue; it’s a one-producer one-consumer FIFO. It doesn’t need any of the cleverness that a queue process provides.

1 Like

There is effort in creating a queue. A tiny amount of effort — one line in a while loop. And there is also effort in reading and writing a CSV. There’s always effort, of course. In this case, I suggested redis, so that would need to be installed, which is also a tiny bit of effort.

There are going to be two processes whether we use CSVs or a queue, and either could fail, as could the entire machine or its power supply. No-one specified robustness, but a degree can be presumed. So, sure, there are failure cases, but I see no significant difference between the two cases.

So, I’m not disregarding robustness as premature optimisation, but it’s not a priority. If robustness was a priority, we’d likely not use a pi. YMMV :wink:

The implementation I suggested is a queue, but it’s also FIFO. Though, since there is only one producer, it doesn’t really matter.

One additional point, if loss of the tweet message were acceptable on failure of the machine (we don’t know), then the queue solution could run entirely in memory, thus requiring no disk management or maintenance. It would just work, unless it didn’t!

As you suggest, if we are likely to be doing more stuff with this system, a more flexible solution is preferable, and the queue idea is better.

I’m keen on playing with redis, as I have very little experience with it, and this would provide the perfect opportunity to do so - but I also have some additional requirements (thought up over the past week) which I’d like the solution to take into account.

  • Ideally, I want to avoid showing the same tweet twice, so I’m going to need some way of storing the timestamp/id of the last tweet in the last batch-get from the API.

  • Additionally, I want the ability to power off the machine (to swap the SD card for another one, and play with the UnicornHAT for other projects), and have it resume from where it left off once I power it back on, taking into account (and thus displaying) any tweets which might have been sent while it was turned off.

I’ve not given these too much thought as of yet, but I’d imagine a file-based solution, using a standard file-naming convention, can solve them both - where as I’m not sure that redis can (although, as discussed, I’m new to redis, so I’m not entirely sure what it can do).

As always, feedback welcome.

Redis persists to disk at intervals and has the equivalent of a relational DB’s write-ahead logging (docs here), so it’s not out of the running :slight_smile:

You jest, surely? Don’t write your own queue! There are a bunch of great queueing servers out there. I suspect you didn’t mean that, but instead meant that there’s a while loop which talks to the queueing server – redis or whatever, as you note. So, I think you’re undervaluing the effort required in configuration and daemon management: you need to install your queue manager (not hard), and then configure it to work the way you want. That bit is hard, because you have to understand the queue software and how its config works and what that means: do you want the queue to be preserved if the queue manager crashes? How do you configure your software to do that? To restore the queue after a dirty restart? All of this stuff is possible: none is trivial. Redis is better than this at most because Salvatore works really really hard on this particular aspect, but you still have to understand it quite well to configure it right. That’s why, for example, running the queue purely in memory isn’t a good idea – you don’t just lose the message you’re working on, you lose all messages in the queue on any crash, which is not ideal at all. Basically, I’m of the opinion that using the right tools for the right job is, of course, the right thing to do… but using a very powerful tool to meet a very simple need is often extremely difficult exactly because you have to work out how to configure all the power out of the tool so that it only meets your simple use case!

However. I should note, here, that I have known form with thinking “this project doesn’t need anything complex like Django, I’ll just lash something together” and then six months down the line realising that the project is still going, is more complex now than it was then, and I wish I had the django admin lying around. So there is a reasonable case for using something with way more features than you need now, because you might need them later and adding them later will be really hard. This idea is in direct contravention to the whole “keep it simple, stupid” approach, and deciding where the boundary lies is something that I’m working on getting better at myself!

<\sigh> Clearly.

I’m really not.

I said:

There is no need to turn off redis’ persistence.

See above. I use this approach all the time.

This isn’t over-engineering. A powerful tool is fine when it’s simple to use (and, as in this case, has a negligible overhead). No-one’s suggesting rabbitmq or even AWS SQS.

That also solves the initial problem, both simply and elegantly.

My tag line is: obsessed with simplicity. As I mentioned earlier, I regard the CSV solution as more complex.

My day job is, basically, helping businesses get out of the mess they’ve created. Typically, a business that has become successful without any focus on the quality of their software (by which I chiefly mean maintainability) and that now finds itself painted into a corner (where the limiting factor has become the ability to adapt their software).

After lack of tests, the biggest problem I encounter is home brewed solutions to problems that have grown out of all proportion to the original use and on which the whole system depends. You probably know the rest of the story: no-one knows how deep the rabbit hole goes, and no-one want to touch it.

On that basis, when developing, I practice a kind of “YAGNI, but you probably will”. If the overhead to provide vastly more flexibility and functionality is very small, then it’s worth doing from the outset.

I should add that we are considering redis here, which I regard as ubiquitous nowadays.

I think we may be talking at cross-purposes here. Let me try and explain what I mean a different way.

If you already know and like redis, sure, definitely use it! But if you’re coming at this from the perspective of “do I need to use a queue? Which queue?” then you don’t know whether to pick redis, or beanstalkd, or rabbitmq, or SQS, or celery, or, or, or. And you don’t know whether redis has persistence enabled by default, or whether beanstalkd does, or whether rabbitmq does… and you don’t know how to find that out without a relatively deep dive into the documentation for each of those systems, which is a lot of work… and you don’t know how to ensure that the queue server starts on boot, and restarts on crash, and restores its data automatically in both of those situations, and how to configure how much memory it uses, and …

As I say: redis does a pretty good job of sensible defaults for most of these things. But if you’re not already a semi-experienced redis user, you don’t know this. And even if you are, you don’t know it about beanstalkd, and so you don’t know how to choose which is best for your current project.

I think that there is a huge amount of sheer decision paralysis which comes from not knowing anything about an area and having to choose between all the different possibilities there. That’s quite a terrifying experience, for most people. I don’t think that using redis is overengineering. But I do think that it’s difficult to get to the stage where you know enough about it to be confident that it’s actually doing what you want, and normally getting to that stage involves getting bitten a bunch of times when you thought it was doing a thing and actually it isn’t. As noted, Salvatore really cares about the default experience and so works hard on this; others don’t anywhere near as much.

I guess we’re miles off topic now :slight_smile: I pretty much agree with what you’ve said, though I believe you overstate the complexity, and I don’t share your view about the complexity of learning redis. It’s a k/v store with tons of extras; learn them as you go. (Geez, beanstalkd, folk I talk to keep bringing that up.)

For sure, knowledge and experience is a bonus. However, @LimeBlast asked a question, and I suggested a solution. I’m more than happy to extend that to knowledge sharing. I genuinely wish more folk in our business asked questions and persisted. If everyone goes through trial and error on each problem, then progress will be excruciatingly slow.

At bottom, I’d prefer (wish?) folk didn’t revert to CSV (or even flat file) solutions as the default “simple” solution. I don’t believe they are, in most cases. (Caveat: CSV solutions are valid, especially for exports for users requiring spreadsheet data, and slow moving, large datasets that can be compressed. And yup, flat files are cool too. I use them :slight_smile: )

1 Like

That’s fair comment, indeed. :slight_smile: We have different philosophies on how to set up a project, I think, but that doesn’t make either of them wrong, and I suspect this discussion has probably been enlightening to anyone wondering which path to go down – some will go your way and some mine, and we can grab a beer and talk about who was right at some point :slight_smile:

2 Likes

:thumbsup: (meet the minimum character count)

I’m not sure how I tripped over this thread again but since I saw it the first time, I’ve watched this https://www.infoq.com/presentations/Simple-Made-Easy

It contains some interesting discussion on ‘simple’, ‘complex’ and ‘easy’.

You are in a maze of twisty braids…

Proudly sponsored by Bytemark