Experienced product innovator and former Nokia quality engineer who was directly involved in the launch and support of Linux-powered mobile computers like the N800. 2011 Nokia Developer Champion, three-time maemo.org community council representative and current MeeGo community advocate, working on grassroots marketing process and the MeeGo community device program as well as other key community initiatives. Founder of Maemo Greeters and MeeGo Greeters, successful community self-help programs. Manages MeeGo Network DFW.
Writer for Tabula Crypticum on “best practices, random analyses and sober speculation”, the Intel AppUp community and MeeGo Community Office.
Texrat | 28 March, 2011 21:18
Friend and Maemo/MeeGo bugjar master Stephen Gadsby alerted twitterites yesterday to a Fedora bugzilla flamefest, and at first blush it made for interesting comic relief. Who doesn’t enjoy a good Internet argument?
But a second read sobered me up quickly. The bug turned out to be an issue introduced into the crucial (and occasionally controversial) glibc code library that doesn’t appear to have been sufficiently regression-tested. The code change reason is described as an execution speed improvement, but it appears to have come at the expense of pre-emptive error-checking.
Most people aren’t going to care about the technical reasons underlying the discovered bug. Most will, instead, be concerned with its impact. And that gets us to the reason behind me writing today.
The first known manifestation appears to have shown up on a Flash-powered website (save that information for later– we’ll get back to it). Distorted audio was noted. Long story short, impressive detective work on the part of the Red Hat Linux community narrowed the cause down to an efficiency improvement in glibc that had the unfortunate side-effect of corrupting system memory. Further investigation proved the problem did not exist in Fedora 13 but is blatantly apparent in the Fedora 14 release.
What’s particularly noteworthy in the case of this bug is the participation of Linus Torvalds, famous father of Linux. Linus’ involvement in this bug report escalated when certain community members defended the rationale behind the recent glibc code change, and hyperfocused on details that are either irrelevant (i.e., the proprietary nature of Adobe’s Flash technology) or lacking in critical context. What followed was the typical aggressive exchange common on the internet when two sides are right in their way, but one fails to recognize the bigger picture.
The big picture in this sense has users in it.
Linus very appropriately identified the problem:
I’d personally suggest that glibc just alias memcpy() to memmove().
Yes, clearly overlapping memcpy’s are invalid code, but equally clearly they do happen. And from a user perspective, what’s the advantage of doing the wrong thing when you could just do the right thing? Optimize the hell out of memmove(), by all means.
Of course, it would be even better if the distro also made it easy for developers to see the bad memcpy’s, so that they can fix their apps. Even if they’d _work_ fine (due to the memcpy just doing the RightThing(tm)), fixing the app is also the right thing to do, and this would just make Fedora and glibc look good.
Rather than make it look bad in the eyes of users who really don’t care _why_flash doesn’t work, they just see it not working right.
There is no advantage to being just difficult and saying “that app does something that it shouldn’t do, so who cares?”. That’s not going to help the _user_, is it?
And what was the point of making a distro again? Was it to teach everybody a lesson, or was it to give the user a nice experience?
That last rhetorical question is key here. Purists defend the recent glibc changes, regardless of detrimental impact, on the basis of the ostensible speed improvements– and claim that it is up to application developers, such as Adobe’s, to exercise the due diligence necessary to prevent memory corruption. But such defenses blithely ignore the responsibility of upstream developers to implement reasonable safeguards, and even more importantly, the entire raison d’être of software in the first place:
To solve a problem for users.
Along with cohorts Dan Leinir Turthra Jensen and Timo Härkönen, I covered this topic tongue-in-cheek in a presentation at the inaugural MeeGo Conference in 2010. But the lighthearted approach doesn’t take away the seriousness of the subject. When I ask “who are we coding for?”, I believe I’m in the same ballpark as Linus Torvalds. After all, if execution speed at any cost is the goal, let’s strip out all error checking from core code libraries and let downstream developers worry about the consequences. Right? Think of the “wasted” clock cycles we could get back!
But in all seriousness, we don’t code in a vacuum. Our work has consequences. As developers, upstream, downstream or at any point along the solution continuum, we need to exercise a practical, reasonable responsibility to protect users from software mischief. Of course, we will still disagree on what is reasonable at times, but as Linus points out, simply adding users into the equation should resolve that dilemma in most cases. How useful is it to code from ivory towers?
It’s difficult to do full justice to the discussion that inspired this post. There’s a great deal of history and biases involved that struggle to pull the discussion into old, festering tangents. And I’m certainly not trying to demonize one side in the debate, or trivialize the validity of any fact-based points. But I believe the quick and detail-focused defense of the change and its risks is disingenuous, and exposes a flawed process. Even worse, I believe that embedded in and underlying those defenses is the idealist thinking that marginalizes Linux as a “geeks only” operating system.
Who are we coding for?
The needs and expectations of users must be a key part of any solution development process, and indeed I highly recommend that average users be involved to some extent in regression testing. I have found my best testers to fit under one or both categories: people willfully trying to break things, and/or those who are not knowledgeable of the application or its ecosystem. Satisfy those two classes of users, and odds are you’re putting out a fault-tolerant product… at the very least.