.
Code is *hard* - Grin with cat attached — LiveJournal
Previous Entry Next Entry
Code is *hard* Sep. 17th, 2003 09:45 pm

Just realised the caching strategy for the lj-userinfo XMLRPC interface is going to be a bugger; essentially "known data" will be a (probably) sparsely-populated matrix, with elements expiring (fairly) independently.

Since I can't just fetch exactly the user/value sets I'm missing (if I want to wrap multiple data fetches up in one call) I need to construct a minimal square superset of data... probably by building a 'need' matrix, matching against a 'gotcached' matrix, and marking whole rows/columns as 'dirty' - up to a maximum fetch size.

Problem is, this is succeptible to a simple pathological case where one user is missing all data, and each user is missing one piece of data; whole row dirty = all cols dirty; whole col dirty = all rows dirty.

I think this can be tackled by ensuring that data is only ever aggregated on a like basis; if you want one person's life story, and each of their friends's journal types, do 2 fetches!

Nuts - and I need to store refusals to serve, too - or I'll end up trying to refetch refused data every time...

Hrm, better make a comment on that last one to ciphergoth's latest lj_dev post tomorrow...

From: ciphergoth
Date: September 18th, 2003 - 12:43 am (Link)
Yes, please do post about it!

Been thinking about this, I think that for most applications the right strategy is to do one fetch for every distinct set of wanted fields. That way you never fetch more than you need.
From: wechsler
Date: September 18th, 2003 - 03:13 am (Link)
But if you have (as is pathologically possible, especially with cached data taken into account) a different set of fields for each/most user(s), you're going to do a whole load of (possibly very similar) fetches rather than one or two "slightly wasteful" ones.

I'm not sure either way is right, BTW, just trying to weigh the options.
From: ciphergoth
Date: September 18th, 2003 - 03:42 pm (Link)
A general solution would I think be hard; I wouldn't be surprised if there's an NP-complete problem hiding in there. I think the set of sets that most applications might find themselves wanting will be pretty small. The per-fetch overhead isn't obscene and the number of fetches is bounded by the number of users with unknown data. If you come up with an interesting solution then I'm fascinated but I think the relatively naive approach I've outlined will usually work close to optimally.

Hmm, how about something like this? Set an "inefficiency bound" that determines the maximum number of wasted results in a given row/column you can tolerate. Consider the query which queries all "active" fields and all "active" users. If contains a bad row/column, discard a row/column that gets rid of a maximal number of wasted results. Repeat until your query is within the bound, then send it off. Repeat until there are no unanswered queries.

I can see pathological cases for this too. Still thinking...