Reclaiming Cyberspace

Governments of the Industrial World, you weary giants of flesh and steel, I come from Cyberspace, the new home of Mind. On behalf of the future, I ask you of the past to leave us alone. You are not welcome among us. You have no sovereignty where we gather.

- A Declaration of the Independence of Cyberspace (1996)

Cyberspace is the “place” where a telephone conversation appears to occur. Not inside your actual phone, the plastic device on your desk. Not inside the other person’s phone, in some other city. The place between the phones. The indefinite place out there, where the two of you, two human beings, actually meet and communicate. Although it is not exactly “real,” “cyberspace” is a genuine place. Things happen there that have very genuine consequences. This “place” is not “real,” but it is serious, it is earnest. Tens of thousands of people have dedicated their lives to it, to the public service of public communication by wire and electronics.

- The Hacker Crackdown (1992)

At first, no one really knew what the internet was for. We knew what it was, and people were coming up with all sorts of uses for it – bulletin boards, instant messaging, mailing lists – but there was this general sense that the internet’s potential went far beyond any of that. Bigger things were coming; it’s just that no one quite knew what they would look like. To capture this sense of excited uncertainty, sci-fi authors of the ’80s and ’90s leaned heavily on the idea of cyberspace.¹

As a metaphor, cyberspace was popular because it captured the sheer breadth of possibilities that exist with networked technology. By suggesting an analogue to physical space, which has been the venue for the entire range of analog human experience, cyberspace seemed to suggest a limitless new domain: new experiences, new possibilities, new potentialities. It’s an exciting thought. Fast-forward to 2019, and we’ve fallen a bit short of expectations. Many people’s experience of the modern internet consists of four social media sites, each of which is largely comprised of screenshots from the other three.

OK, maybe that’s a bit much. But grant me that bit of hyperbole and in return I’ll let you in on a secret: cyberspace hasn’t gone away. We’re still in it, we’ve just been spending all our time in the wrong parts of it.

More specifically: we’re spending our time in (cyber-)spaces that are privately owned. They’re private property, controlled and run by large companies who generally stay in business by treating their userbase as a product. Their business model is fundamentally abusive: their goal is to try and take advantage of you as much as they can without you leaving (or, ideally, even noticing). So much of the modern internet consists of these spaces that their downsides have come to be seen largely as downsides of the internet itself. This is an easy mindset to fall into, but it is nonetheless mistaken. The issues of private ownership can be avoided by building public spaces on the internet – shared experiences mediated by systems with shared ownership. Networks of volunteers coming together to create entire new categories of common goods. Public cyberspace.

Previous attempts at building such spaces (albeit under different names) have encountered seemingly insurmountable challenges. However, recent research² has offered new hope for overcoming these. If successful, the benefits of the resulting public cyberspace over our current systems would be significant. Let me explain.

Architecturally, the modern web is built around a client-server model: data requesters (like you, loading up Twitter) and data providers (like Twitter, sending you fresh tweets). This data lives on servers; the servers live in some large air-conditioned room with armed guards; these servers dole out data to clients like yourself at-will.

What sort of cyberspace have we architected here? Who, if anyone, owns it? Does the service’s owner have power over you? Certainly they do, and certainly they own the space – though most major companies go to great pains to hide this fact.

How do they hide it? With the same tricks that have worked for generations. Early on in The Hacker Crackdown, while chronicling the development of the phone network, Bruce Sterling recounts how, in the late 1800s, telephone technology was first normalized:

Contemporary press reports of the stage debut of the telephone showed pleased astonishment mixed with considerable dread. Bell’s stage telephone was a large wooden box with a crude speaker-nozzle, the whole contraption about the size and shape of an overgrown Brownie camera. Its buzzing steel soundplate, pumped up by powerful electromagnets, was loud enough to fill an auditorium. Bell’s assistant Mr. Watson, who could manage on the keyboards fairly well, kicked in by playing the organ from distant rooms, and, later, distant cities. This feat was considered marvellous, but very eerie indeed.

… After a year or so, Alexander Graham Bell and his capitalist backers concluded that eerie music piped from nineteenth-century cyberspace was not the real selling point of his invention. Instead, the telephone was about speech – individual, personal speech, the human voice, human conversation and human interaction. The telephone was not to be managed from any centralized broadcast center. It was to be a personal, intimate technology.

When you picked up a telephone, you were not absorbing the cold output of a machine – you were speaking to another human being. Once people realized this, their instinctive dread of the telephone as an eerie, unnatural device, swiftly vanished. A “telephone call” was not a “call” from a “telephone” itself, but a call from another human being, someone you would generally know and recognize. The real point was not what the machine could do for you (or to you), but what you yourself, a person and citizen, could do through the machine.

Modern social media sites have learned this lesson well: while the individual interactions taking place over the web are always between you and various servers, the perceived interactions are between you and your friends³ (or enemies, if you’re that particular sad sort of person). It’s easy to forget that everything you do is being proxied through these third-party servers. This humanization of the technology obscures an important fact: social media services’ interposition into your interactions with your friends effectively places these interactions on private property belonging to not to you or your friends but to the service you are using.

Now, there are some big advantages to building web sites with this centralized, business-run, private-property model. We’ve spent few decades doing things this way, and by now we’ve gotten pretty good at it. There are lots of centralized web sites that we don’t yet know how to build any other way.

That said, there are also significant drawbacks to this model. Servers aren’t cheap. Most web sites have needed to make money to stay online, which turns out to be pretty difficult. Hence the advent of online advertising, which has since been described (by people who would know) as the original sin of the internet.

Decades down the road, we’ve discovered that the digital ad ecosystem is just as bad as any other funding model we’ve tried – it just has a less obvious failure state. However, the lifeblood of this ecosystem – tracking data, the more personal the better – has turned out to be tremendously valuable to a lot of companies for a lot of reasons.⁴

And so, even as ad-based revenue models are looking more and more like a dead-end, the tactics used to deliver targeted advertisements have turned out to stand on their own as a revenue source. They are goldmines of incredibly personal information. This information is readily bought and sold, and – needless to say – the groups buying information about you rarely have your best interests in mind.

To wit, this information is frequently used to slash credit and deny loans, to raise interest premiums, to enable hate speech, to single out psychologically vulnerable individuals for manipulation,⁵ to enable domestic violence… The list goes on and on. When handled sloppily, this data can – and has – outed pseudonymous performers, members of anonymous recovery groups, recipients of mental health care services, and more. The chilling effect of pervasive surveillance on civil society and on civil dissent in particular⁶ is also well-documented, and has been for quite some time. This surfeit of data can cause, and has caused, immeasurable actual harm to actual people.

All this is repulsive beyond words. It is a fundamental breach of human trust, committed casually and constantly. But of course, the market doesn’t care about that: companies have a profit incentive to collect and monetize this data, so collect and monetize they do. And as long as we’re spending all our time in their private corners of cyberspace, there’s nothing we can do about it. It’s like we’re carrying out our entire social lives in a mall’s highly surveilled food court, with the mall transcribing all our conversations and selling them to the highest bidder.⁷

One response to this unintended consequence of centralized systems has been to try to decentralize by building new, federated⁸ systems for web sites (with special focus on social media sites, since in this day and age social media is where most of us spend most of our time online, as well as where some of the most openly user-hostile companies are).

The idea is this: instead of having a service provided by a big, monolithic server (or set of servers) owned and operated by one company, you have a number of smaller, volunteer-run servers, likely (though not necessarily) run out of different physical locations, which collaborate to provide a similar experience to centralized web sites. Distributing responsibility in this way also distributes ownership (somewhat!), limiting the amount of data that can be collected on users.⁹

The move from centralized to federated sites brings some welcome changes. One can have a somewhat greater degree of confidence that the platform is respecting one’s privacy. This is like swearing off the food court and hanging out at someone’s house instead: It’s chill, as long as they’re chill. But you’re still on someone’s private property, so you’re still placing trust in someone, and things can still go very wrong.

Perhaps the most popular federated social media site in 2019 is Mastodon, which is actually doing pretty well as these things go but still has a long way to go before they can claim serious adoption (and they might want to find a better word than ‘toot’ for their version of tweets – but then again, what do I know).

With all federated services, you make a number of compromises. You lose the certainty of a monolithic corporation data-mining everything you do or say through them, but you pick up the risk of random individual strangers trying to do the same. You also place your trust in these strangers’ abilities as sysadmins, which is a lot of responsibility to leave on an unpaid stranger’s shoulders, especially long-term.

You also lose the skilled, well-paid security teams of these large companies – teams who work hard every day to protect your data by making sure it gets leaked only to those who are paying for it and no one else. In their place you have some poor specific individual sucker, and you just really hope they’re doing their best, even though they’re defintely not getting paid enough for this shit.¹⁰

Overall, federation represents an improvement on centralization, but it carries over a number of centralization’s problems, a few of which it makes much worse. The story of federated systems is in many ways the story of centralized systems, but writ small and heavily repeated.

If we want to get away from this set of issues, perhaps it makes sense to get away from centralization entirely. The polar opposite of centralized systems are distributed systems. These are peer-to-peer networks that provide services through the power of mutual aid. Every peer shares a small amount of responsibility for the system’s integrity. They’re a little bit like employee-run companies that anyone can join.

When built well,¹¹ these systems are surprisingly robust and can create reliable services that somehow exist between the peers involved in maintaining them – carving out what might reasonably be considered public cyberspace.

What does public cyberspace look like? Well, perhaps the most famous example of a distributed, peer-to-peer system is BitTorrent, which was recently reported to constitute 22% of all internet traffic globally (for scale, Netflix claims to be responsible for 15%). BitTorrent is perhaps best known for its popularity among the pirating community (that 22% stat has been known to spike by up to ten percentage points during Game of Thrones season) but it is also used to speed up downloads for video game installations and updates, for open source software’s installers, for quickly mirroring data from vital services like The Internet Archive, and much more. Rumor has it that Facebook also uses BitTorrent internally for moving large files around.

From a design perspective, the premise of BitTorrent is this: you break large files up into small chunks, and you download a bunch of these chunks at a time in parallel from a bunch of different people. Once you’ve downloaded part of a file, you can start sending those parts to other downloaders, and perhaps they will reciprocate (or perhaps they won’t – and that’s ok, because there are enough generous uploaders that a few unhelpful peers don’t spoil the thing for everyone). In the end, as long as the honest peers in the network collectively possesses all parts of the file, everyone will be able to download the whole thing – and the whole process is usually a lot faster and more fault-tolerant than it would be if you were downloading from a single source (like, say, a web server).

To echo an earlier question: what sort of cyberspace are you taking part in when you join a BitTorrent peer swarm? Does anyone own this space? The network itself is totally ad-hoc (although your introduction point to it might not be). You have no client-server relationships, only peer-to-peer. There are no mediators; no social-media giants are interposing themselves into your interactions. This is not to say you’re completely secure. You still place some small amount of trust in your peers. However, the range of misbehavior that peers are capable of is much, much more limited than what a corporation¹² can and will do.

Of course, BitTorrent is a pretty limited application. It does one thing, and does it well – but it only does one thing. What else are these peer-to-peer systems capable of? In spite of decades of research, that is still something of an open question in 2019, largely because there doesn’t seem to be a lot of money to be made by answering it. Nevertheless, it’s an important question and one that I’d very much like to see more exploration of. Much of the untapped potential implied by the idea of “cyberspace” lies in peer-to-peer systems. We’ll go into more depth on this later.

Returning for a moment to the example of BitTorrent: I mentioned earlier that one sticking point with BitTorrent is getting an introduction to a peer swarm. This turns out to be a challenge shared by almost all peer-to-peer systems: Once you’ve met some peers, you’re set, but getting to that point can be tricky. How is it done today?

The “traditional” way to get started is via a peer tracker, a centralized service that peers can register themselves with and which in turn serves up a list of registered peers on demand. Not only is this a central point of surveillance, it is a central point of failure.

A step beyond running a peer tracker is to offer mirrors, which are essentially a federation of the tracker service. In this strategy, a bunch of folks grab the same content index and start serving up their own trackers from it, essentially forcing copyright regulators and their ilk to play whack-a-mole to try to take everyone down. This is rarely a permanent solution, but at the very least it can buy some time.

A step further is to use magnet links. The way these work, frankly, is one step removed from magic. A magnet link is a small string, short enough to easily copy-paste or even transcribe by hand (if you’re a good typist). This string defines an address where one can look to get details of a torrent, and a list of peers for that torrent. Where do you look for this data? Not on a web site, but rather in something called a distributed hash table or DHT.

A DHT is essentially a very simple database that a bunch of peers collectively maintain. It works off a peer-to-peer protocol, much like BitTorrent does, but this protocol offers something different. It lets you store small chunks of data (like the contact info other peers need to connect to you) at addresses (like the one you get from your magnet link), and it lets you query any address to see all the data everyone has stored there. Anyone can connect to it, anyone can store data on it, and anyone can get data from it. Some people maintain very long-lived presences in the network, serving as reliable points of reintroduction, and trackers for DHT peer swarms exist as well. This is where your computer goes to resolve magnet links, and it works, and honestly the whole thing seems almost too good to be true.

To return to our earlier question: what sort of (cyber-)space have we architected here? Like I said, I’m inclined to look at it as public cyberspace. It’s like a public park or a library: something which exists because a small number of people care enough to look after it, and which therefore is available for a huge number of people to use and enjoy, free of cost. No one owns it – we just coexist within it.

There’s a reason this seems almost too good to be true: it almost is. Malicious interference in peer-to-peer systems turns out to often be very easy, and defending against this is often nontrivial. BitTorrent protects itself by using lots of integrity checks – as long as you have a torrent file describing the file you’re trying to download, you’ll be able to validate every chunk of it and keep your peers honest – but with a construct like a DHT, the contents of which are constantly changing, there is no such single source of truth. As a result, it turns out that most real-world DHTs are very easy to mess with, in large part because no one has a full view of the system.

The Sybil attack consists of deploying a huge number of peers on the network – peers which you control, but which look like distinct entities to anyone else. This gives you a lot of leverage, and while each DHT protocol fails differently, they all do fail sooner or later. It has been shown that perfect defense against Sybil attacks is practically impossible (PDF link). Then again, if we’re keeping score, it could be argued that perfect defense against standard threats for a large corporation is practically impossible too. The DHT that most magnet link resolutions go through, Mainline DHT, makes no attempt at defending against Sybil attacks, and so it experiences many of them each day (PDF link). These attacks are easy to carry out and have the potential to completely and arbitrarily censor any data from MLDHT.

While “perfect” defense against Sybil attacks may not be possible, effective mitigations do exist. The Theseus DHT project, which I have been working on nonstop since 2016, exists to explore these. The approach taken is much more analytic than most previous work, and the results this project has achieved so far are very encouraging.¹³

A construct like Theseus DHT, if successfully realized, would provide a tangibly useful service. It would be able to support many sorts of ephemeral data sharing and storage, serving as a data layer for entirely new app architectures. As for what types of apps Theseus DHT could support, the sky’s the limit. Here are some examples.

The project’s namesake and original motivation, Theseus, a planned service for sharing, storing, and searching research, the design of which is described in some depth here.¹⁴ From that post: Imagine a system with the reference features and massive scope of the modern scientific paper repositories – arXiv, Embase, et al. – but completely decentralized and peer-to-peer. The processing power required for all of the system’s features, including searching for and downloading papers or data, would be shared across all the network’s participants. Anyone with an internet connection could join and use a friendly web interface to immediately access anything stored in the system.

How about secure chat app with no central list of users, where identities are established via public keys and all messages are encrypted? The message transport could be built using existing protocols like OTR or Signal, with peer introductions handled using the ideas laid out here (or in any number of other ways).

As for social media, one common complaint about Twitter is that old tweets are rarely more than a liability, and many services exist for automatically deleting them. Newer social media and chat apps typically build in ephemeral messaging by default (think Snapchat, or Signal’s expiring texts). To that end, a Twitter-like app with ephemeral tweets stored directly on the DHT would be trivially straightforward to build.

Other, more complex forms of social media could be supported as well, although these would be significant projects in their own right. If this sounds interesting to you, hit me up for an invitation to the Theseus DHT Slack channel¹⁵ and we can brainstorm.

It would be straightforward to build a distributed mirror of the Internet Archive which dynamically retrieves content based on the torrents they publish. Of course, for the system to be fully decentralized, an index of these torrents would have to be stored in the DHT as well, but that shouldn’t be too difficult.

As for more ambitious ideas: how about automatic mirroring for sensitive data (e.g. with people volunteering to mirror encrypted copies of data leaks, to ensure journalists can maintain access to them even under extreme circumstances) – a service that would use largely the same protocols as Theseus, and which could conceivably even be integrated with systems like SecureDrop to automatically back up encrypted copies of anything submitted to a newsroom as soon as it arrives.¹⁶

Theseus DHT is also a natural fit for making introductions to peer swarms. The fact that Theseus DHT is designed for resilience against Sybil attacks makes it preferable over MLDHT for this purpose, as does the fact that it uses strong, carefully-engineered encryption to help protect peers’ privacy even under pervasive surveillance.

What this means is that a system like Theseus DHT would be not only a powerful construct in its own right but also an ideal bootstrapping point for the next generation of peer-to-peer apps – a layer on which they can build, a substrate, a common good commonly used and maintained, with common benefit. Of course, all this depends on developers recognizing the potential that exists here and deciding to help develop this corner of cyberspace. That’s why I’m not necessarily saying this is the future. What I’m saying is: We should be so lucky as for this to be the future.

Style note: Throughout this post, “cyberspace” refers to the word cyberspace when italicized, and to the concept of cyberspace otherwise. ↩
Some of which has been carried out by myself through the Theseus project, which you’ll hear more about later. ↩
For instance, suppose you post a tweet, and a friend of yours retweets it. Technically, the interactions here are: you contact Twitter’s servers and ask them to store a new tweet from your account on your behalf; Twitter accepts your request; your friend contacts Twitter’s servers, asking for fresh tweets; Twitter accepts your friend’s request and shows them your tweet; your friend contacts Twitter’s servers again, asking to retweet your tweet; Twitter accepts this request as well; soon after, your Twitter app contacts Twitter’s servers of its own volition on your behalf, asking for new notifications; Twitter responds with the news that you have been retweeted. Note how you and your friend do not at any point interact directly over the network, even though the perceived interaction is between you and them (to wit: you post a tweet; your friend retweets it; you see that your friend has retweeted it). ↩
For more on this, I cannot recommend Bruce Schneier’s book Data and Goliath highly enough. I’ve written about that book previously; you can see what I had to say here. Long story short: it covers an incredibly important subject, makes explosive claims from start to finish, and backs them up at the end with more than 100 pages of citations. You can probably find or request a copy at your local library. ↩
From that article: Facebook isn’t a mind-control ray. It’s a tool for finding people who possess uncommon, hard-to-locate traits, whether that’s “person thinking of buying a new refrigerator,” “person with the same rare disease as you,” or “person who might participate in a genocidal pogrom,” and then pitching them on a nice side-by-side or some tiki torches, while showing them social proof of the desirability of their course of action, in the form of other people (or bots) that are doing the same thing, so they feel like they’re part of a crowd. ↩
From Data and Goliath (Schneier, 2015, page 115): These chilling effects are especially damaging to political discourse. There is value in dissent. And, perversely, there ca be value in lawbreaking. These are both ways we improve as a society. Ubiquitous mass surveillance is the enemy of democracy, liberty, freedom, and progress. Defending this assertion requires a subtle argument – something I wrote about in my previous book Liars and Outliers – but it’s vitally important to society. Think about it this way. Across the US, states are on the verge of reversing decades-old laws about homosexual relationships and marijuana use. If the old laws could have been perfectly enforced through surveillance, society would never have reached the point where the majority of citizens thought those things were okay. There has to be a period where they are still illegal yet incresaingly tolerated, so that people can look around and say, “You know, that wasn’t so bad.” Yes, the process can take decades, but it’s a process that can’t happen without lawbreaking. Frank Zappa said something similar in 1971: “Without deviation from the norm, progress is not possible.” ↩
Or, in Facebook’s case, selling them to just about anyone who bothers to put in a bid. ↩
The term derives from the concept of federation in political science, where it signifies the formation of a political unity, with a central government, by a number of separate states, each of which retains control of its own internal affairs. (source) ↩
It is further assumed that altruistic volunteers are less likely to indulge in invasive tracking, as well as perhaps less technically capable of implementing it. ↩
In fact, usually they’re not getting paid at all. ↩
The details vary from system to system, but usually “built well” entails building in thorough error-checking, lots of redundancies, or both. ↩
(or even, in some cases, a federated system node’s maintainer) ↩
In short, a combination of tactics is used to make Sybil attacks expensive to perform, easy to detect, and easy to compensate for (once detected). The network is also designed to admit straightforward mathematical analysis, which allows for formal validation of each of the aforementioned tactics’ effectiveness. Technical write-ups are forthcoming. Hit me up on Twitter (my DMs are open) if you’d like a peek at the drafts :) ↩
This project, which places an unusually strong emphasis on high availability and resistance to censorship, was originally motivated by explicit threats of censorship of climate research made by the Trump EPA. Ideologically, it has much in common with existing systems like Sci-Hub. ↩
Anyone is welcome, but I’m not sharing the link publicly quite yet, because (at least for now) I want to keep the group small enough that everyone can know everyone. ↩
In terms of security, this would be something of a trade-off, as it could allow interested parties to monitor SecureDrop activity indirectly, and could result in ongoing data disclosure from one-time key compromise. Proper implementation would require regular key rotation, frequent “junk” uploads to obfuscate legitimate traffic, extremely careful key management, and more. That said, if such a system were properly implemented, it would offer tremendous data resilience even against powerful adversaries. ↩

Theseus Protocol v1.0 Overview

I just published Version 1.0 of the Theseus DHT Protocol specification. This is especially exciting because it went live on Friday evening, April 20th – which just happens to be the one-year anniversary of the version 0.1 spec’s publication.

The 1.0 spec contains a number of important changes, and reflects a lot of conceptual and material improvements to the design that I’ve been able to come up with over the last year. I don’t expect to have to make any more changes to the spec before finishing a 1.0 version of the actual implementation.

There’s still a long way to go as far as implementing everything described here, but publishing an authoritative spec for reference is a big step forward and I’m really excited about how far this project has come and where it’s headed.

Hello GitHub Pages!

As you can tell if you’re reading this post, I’ve decided to migrate my blog from Blogger to Github Pages.

Blogger’s web editing interface is a pain. Tweaking your blog’s theme is a nightmare. I also dislike relying so heavily on a Google-hosted platform.

Github Pages, on the other hand, offers super easy styling and customization. You can even write posts in Markdown, using whatever editor you prefer. Github seems to be a great and reliable host, too.

Spread the word: The new home of Sohliloquies is here at https://wootfish.github.io/sohliloquies/

Net Neutrality and the Theseus DHT

The design for Theseus, as it stands, consists of a distributed hash table (DHT), a torrent client, a distributed search algorithm, and a web-based UI. Until now, I'd been developing everything as one monolithic entity. The purpose of this blog post is to make two announcements: first, that I'm going to be breaking out the Theseus DHT into a stand-alone software package; second, that I'm setting a hard release date for this package: New Year's Eve.

I'd also like to talk a little about the motivations behind this decision.

Book Notes

Post-graduation, I've been finding it easier to carve out free time. Some of this I've spent on Theseus and other projects. More of it I've spent reading. It's important, I think, for technical people to retain some analog hobbies.

These are notes on some books I've read lately. Not everything -- just the stuff I've liked.

Previously:

Theseus: A Robust System for Preserving and Sharing Research

Resisting Sybil Attacks in Distributed Hash Tables

Securely Finding Friends via DHT Dead Drops

Distributed Search in Theseus

The State of Theseus, One Month In

Theseus Protocol v0.1 Overview

Bloom Filter Parameters for Distributed Search

Message Encryption in Theseus

In the previous post, Message Encryption in Theseus, I outlined how on top of robust encryption, Theseus can handle optional public-key authentication. This authentication gives us a way of defeating man-in-the-middle (MitM) attacks, but only if at least one of the communicating peers trusts a key possessed by the other. Since most interactions in peer-to-peer networks take place between total strangers, this limitation carries with it some unpleasant drawbacks.

Previously:

Theseus: A Robust System for Preserving and Sharing Research

Resisting Sybil Attacks in Distributed Hash Tables

Securely Finding Friends via DHT Dead Drops

Distributed Search in Theseus

The State of Theseus, One Month In

Theseus Protocol v0.1 Overview

Bloom Filter Parameters for Distributed Search

Brief Summary

Theseus will use the Noise Protocol Framework.

This will allow all network traffic to be encrypted, with flexible authentication and many other desirable properties as detailed below.

This post explicitly obsoletes the Theseus Protocol v0.1 Overview. That overview deliberately lacked specifics on message encryption. It was expected that said specifics would necessitate some changes, and this has indeed proven to be the case. A v0.2 specification is forthcoming.

Previously:

Theseus: A Robust System for Preserving and Sharing Research

Resisting Sybil Attacks in Distributed Hash Tables

Securely Finding Friends via DHT Dead Drops

Distributed Search in Theseus

The State of Theseus, One Month In

Theseus Protocol v0.1 Overview

Bloom filters are central to both the routing and content-indexing functions of Theseus's distributed search algorithm. A filter's false-positive probability increases monotonically as a function of the number of keys in the filter, as well as on two pre-set parameters: the filter size and the number of hash functions used. The expected number of bits set to 1 in the filter also increases monotonically as a function of n. Derivations of the equations governing these values can be found in section II.A of this paper, which also provides a treatment of the idea of compressing Bloom filters in certain situations, e.g. for network transfer or bulk storage. The question is asked of whether a benefit can be obtained from this compression, and the answer is that it of course depends on the sparsity of the filter.

The strategies suggested by that paper for parameterizing a Bloom filter with compression in mind involve significantly increasing the filter's uncompressed size in memory. This is a fairly unattractive cost.

Previously:

Theseus: A Robust System for Preserving and Sharing Research

Resisting Sybil Attacks in Distributed Hash Tables

Securely Finding Friends via DHT Dead Drops

Distributed Search in Theseus

The State of Theseus, One Month In

Introduction

Hope everyone's having a relaxing day. Things have been a bit quiet over here lately, as I've been busy sampling the boundless delights of graduate life. But fear not: progress on Theseus is still being made. I've been hard at work on the data storage formats described in the previous post's TODO list, and on the specifics of the Theseus protocol itself.

These elements of the design are all interrelated, so it seems appropriate to summarize and discuss them as a unit. That means we've got a lot to cover here, so we're going to break description and discussion into separate sections and make the description section deliberately exposition-light.

Theseus is growing more realizable by the day. Soon, interface design, not protocol details, will end up being the project's top priority. That'll be an exciting shift and I'm looking forward to finding people to collaborate with on it.

For now, though... Let's get technical.

The State of Theseus, One Month In

For just over a month now I've been working nonstop on one part after another of my current pet project, Theseus. Things are coming along very well so far -- in fact, I've surprised myself with the amount of progress I've been able to make in such a short time.

That said, I have yet to write a single line of code. This is because I'm trying to nail down a complete specification of the system first. The system has many moving parts, some of which possess complex interdependencies, and it's important to try to get this stuff right the first time.

The goal of this post is to take inventory of what's been done so far and to list out everything that's still needed in order to reach a point where work on an implementation can begin.

If you'd like to collaborate on Theseus, I'm hoping this post will give you an idea of where your contributions would be particularly valuable. If you're realizing at this point that you're not even sure what Theseus is, I've got you covered: The beginning of my introductory post, Theseus: A Robust System for Preserving and Sharing Research, can take care of that. I'll be directly referencing that post throughout the discussion below.

Previously:

Theseus: A Robust System for Preserving and Sharing Research

Resisting Sybil Attacks in Distributed Hash Tables

Securely Finding Friends via DHT Dead Drops

Search algorithms in peer-to-peer networks have a long, storied, and rather unhappy history. Search sounds like a straightforward problem, but in fact it turns out to be alarmingly difficult even in non-adversarial settings. In realistic models which assume the presence of some hostile peers, the situation is even worse.

It has taken me a while to sit down and write this post. For that, there are two reasons. First off, I've been pretty busy: on top of keeping up with my job, last week was my last finals week before graduating with my Bachelor's, yesterday was commencement, and next weekend is PRCCDC. So I've been spread somewhat thin. The second reason is that search is a very hard problem, and I've been holding off until I'm confident I've got something worth sharing. Luckily, after making some major headway in the last couple weeks, I think I have just that.

Previously:

Theseus: A Robust System for Preserving and Sharing Research

Resisting Sybil Attacks in Distributed Hash Tables

This won't be a long one -- I just wanted to take a few spare moments to jot down an idea for how friends in a Theseus-like network who know each other only by public key but don't have each other's network contact info could first get in touch.

I believe it's possible to do this without anyone else finding that contact info, deducing the social connection, or even knowing that either involved party is trying to get in touch with anyone. Those guarantees would be tricky to establish through naive use of public-key cryptography, but I believe this Diffie-Hellman based scheme can offer all that and more without breaking a sweat.

Previously: Theseus: A Robust System for Preserving and Sharing Research

Note: While just about all of the ideas expressed in this blog post live on in my more recent research on this subject, a number of the specifics given in this post are obsolete. In particular, the mathematical analysis is flawed and overly simplistic. This post is left up for posterity, but it should not be treated as current or authoritative.

Introduction

I'm in the process of designing Theseus's file signature tracking system. A few good options are available here; the one I'm currently leaning towards is a Kademlia-based distributed hash table where the table keys are public keys and table values are lists of signatures by the keyed keys.

The main advantage of this approach is its simplicity. Simple is appealing. Another appealing factor is that our use case is very similar to the use case for BitTorrent's Mainline DHT (MLDHT), which was built to track lists of peers. This similarity means that if we model our approach to a degree on MLDHT then there is a wealth of directly applicable research for us to draw on.

The main disadvantage of this approach is that MLDHT turns out to have some serious security problems at the design level. It inherits these from Kademlia, on which it was based. These problems undermine the DHT's ability to store data reliably. Similar issues exist for virtually all DHT constructions and seem to be a large part of why DHTs aren't more widely used. In general these flaws are intractable; however, in the specific case of MLDHT/Kademlia I think they can be overcome.

The specific problem is that Kademlia is susceptible to various forms of the Sybil attack, a class of attack which plagues practically all distributed, decentralized systems. This issue has been known for a long time, and in spite of more than a decade of research on the subject, few if any of the proposed solutions are viable in practice.

Most published defenses fail in a way that falls into at least one of three categories: those that require abandoning decentralization and introducing a central identity-issuing authority; those that rely on some mathematical property available only in certain contexts; and those that simply do not work in the real world. A number of papers suggest establishing identity using social network data (sacrificing anonymity in the process). Most of these papers are interesting only in that they provide examples of all three ways of failing simultaneously.

A few proposals rely on metrics derived from arcane and potentially volatile properties of the network. Some of these are extremely cool and clever, and would probably work in certain settings, but for our purposes they are nonetheless too complex and/or fragile to rely upon.

The 2002 paper which coined the term "Sybil attack" laid out some basic proofs about the ability of these attacks to persist in decentralized peer-to-peer networks even when the protocol includes extreme defensive measures. Most later attempts to introduce identity management through techniques like e.g. computational proof of work end up failing in exactly the ways these proofs predicted. There seem to be essentially no bulletproof techniques for preventing Sybil attacks, and it is perhaps telling that the BitTorrent swarm hasn't even tried to deploy defenses here.

Trying to get rid of Sybil attacks while maintaining strong guarantees of anonymity and decentralization seems like a lost cause. The story is different, though, for identifying and resisting these attacks. Perhaps it would make sense to just accept the possibility of deploying Sybils on the network and, rather than trying to get rid of them, simply try to design the network in ways that make it easy to respond to them and limit the damage they can cause.

What follows are some ideas for how this could be done in the specific context of a Kademlia-like distributed hash table. My hope is to outline some ways that small additions to the protocol could dramatically increase its resilience without compromising any of its most appealing features.

Apologies in advance for the length of this post. I'm juggling this project with work and my last quarter of undergrad, so I haven't been able to spare the time to write anything shorter.

Suddenly, in the space of a few weeks, I discovered what appeared to be definitive answers to the problems which had baffled me for years … I went about saying to myself that now at last I had done something worth doing, and I had the feeling that I must be careful not to be run over in the street before I had written it down.

-The Autobiography of Bertrand Russell

The Promise

Recently, American scientists have been engaged in an effort of unprecedented scope to back up as much climate data and research as they can get their hands on. In the hours surrounding the presidential inauguration, groups of activists rushed to back up this data on European servers. These groups fear censorship of climate research by the incoming administration, and rightly so. Currently, governmental censorship of science is more than a hypothetical concern.

It's inspiring to see this level of mobilization, but horrifying that it is necessary. The need for this action reveals that we have built an infrastructure that is vulnerable to attack at many levels. We have found that our top scientific institutions can be silenced if the administration dislikes their findings. We have found that certain political groups believe they can shape reality through misinformation and censorship. And we have found that we, the people, have very little in place that's equipped to push back.

It doesn't have to be this way.

Imagine a system with the reference features and massive scope of the modern scientific paper repositories -- arXiv, Embase, et al. -- but completely decentralized and peer-to-peer. The processing power required for all of the system's features, including searching for and downloading papers or data, would be shared across all the network's participants. Anyone with an internet connection could join and use a friendly web interface to immediately access anything stored in the system. They would also optionally be able to volunteer their own spare storage space, CPU power, or network bandwidth to help keep the system healthy.

This is very close to the model used by BitTorrent, and if BitTorrent's success is any indication, this model can lead not only to remarkable download speeds but also to resilience in the face of attempted censorship, even by state-level adversaries. No central server means no central point of failure.

What I'm laying out here is the outline of such a system for sharing important documents on any subject its users consider valuable. The system relies on its nodes to redundantly back up, index, enumerate, and supply the data it tracks. The system also includes support for searching for papers in a variety of ways. Every part of the system is designed with a strong, explicit focus on security and privacy.

I'm calling it Theseus.

If Freedom of Speech Is Your Best Defense, You Have a Problem

Let's keep this short and sweet. I love freedom of speech at least as much as the next person. Possibly more. Back in high school my civics teacher would, between classes, slip me publications on the First Amendment to study, because she'd noticed it was one of the only topics that consistently captured my interest.

How to Lose at Mancala (Consistently!)

Over winter break, I spent a few days fiddling with a Mancala AI. It came together surprisingly quickly! Here's a little journaling on that project.

Background: Mancala is a fun, simple board game with a surprisingly high skill ceiling. I've never been very good at the game, but I've had fun playing it anyway. After a recent series of ruinous defeats at the hands of my own family I started thinking about what optimal game strategies might look like.

(Side note: It turns out Mancala is a solved game (i.e. it's known how to play "perfectly") for the most common numbers of stones or houses. I haven't looked at any of the relevant publications, but my impression is that they relied mostly on endgame databases and exhaustive search. That's no fun, so I'm going to ignore those results and try to build this project from the ground up. The goal is to write a program that can beat me at Mancala, and hopefully to learn a few things about the game from it.)

If you don't know the rules of the game, here's a quick overview of the variant I'll be writing about. I'll be calling the game pieces "stones", and the little areas they sit in "houses".

I opted to do this all in C because I had a feeling I'd need speed here and because I wanted more practice with the language.

You can find the Github repo here.

Academic Computer Science Needs to Get Its Shit Together

The fact is, our beloved field of computer science has reached an embarrassing low. Among programmers in all but some collegiate circles, calling something "academic" is oblique shorthand for calling it overwrought, obscure, inflexible, and/or too fragile to be useful in the real world. And there's a good reason for this: more often than not, the products of academia in computer science meet all those criteria.

But why? Shouldn't universities be where our best and brightest gather? Isn't the whole idea of academia to draw great minds together so they can collaborate and educate?

The short answer: Well, yes, but that idea doesn't work so well when you're competing for talent against an industry that can afford to triple your salary offers. With few exceptions, industry gets who they want and academia gets stuck with the dregs -- and you can't do much with dregs.

What's the long answer? Glad you asked. Buckle up.

Virtualizing a Raspberry Pi with QEMU

A while ago, I wrote about building a rack for a Raspberry Pi cluster. If you have a rack, at some point you'll want to put some Pis on it. Virtualization can make the process of imaging these Pis relatively painless. You can generate custom images from the comfort of your desktop and even automate the whole process. Here's a quick crash course in virtualizing the Pi using QEMU.

All the guides I could find on this were at least a couple years old and were missing various important parts of the process. None of them quite worked out-of-the-box. It's hard to blame them, though, since a hardware platform like the Pi is of course a moving target. For what it's worth, this guide should be complete and up-to-date as of October 2016.

On Ceasing to Be Exceptional

Somewhere just north of a decade ago -- so when I was 10 or so -- I was helping strangers in online forums troubleshoot games they were making. It was crazy fun. I still remember this one guy: He had a 3D FPS he was building in Game Maker, a tool that (for this use case) gives you a graphics engine and not much else. He had mouselook set up (no trivial feat) and had a bare-bones firing system, but it only worked when the camera was leveled out flat. You couldn't shoot up or down. That's a big problem, but the game was still pretty fun in spite of it, so in the spirit of trying to make a good thing better I offered to help.

With the right background knowledge, the solution is pretty clear. The mouselook system gives us an angle for how far up or down we're looking. What we want is for bullets to rise or fall at a steady rate determined by this angle. So we need a value to determine that rate, and that value will end up being a slope. You can turn angles into slopes using trig: slope = rise/run = sin(θ)/cos(θ) = tan(θ). So you take this slope, scale it based on horizontal speed, and periodically add that to the Z coordinate.

This is easy to figure this out if you know the concepts involved, but take it from me: it's a lot trickier when you've never heard of a slope and all you know about trig is what Wikipedia tells you. But with a few days' work I figured it out, sent the guy a patched version of his game, and went back to playing it. All this work, and yet for the life of me I can't remember his reaction.

Book Notes

Over this past couple years I've been lucky enough to find time in between classes and research for a bit of pleasure-reading. I thought it might be fun to write a little on different books that've left strong impressions on me. It's nice to keep a record, but maybe I can also share some of the enjoyment I've gotten from them.

It was tempting to put the word "Reviews" somewhere in this headline, but I'm not really trying to just share ratings and little terse blurbs here. Ratings are subjective and reductive, so it's hard to see much point in cooking them up in this context. I'd rather just share a few thoughts and impressions for each book with the hope that you'll find them interesting.

A Great Paragraph From Infinite Jest

From the fictional filmography of James O. Incandenza (page 988):

Cage III--Free Show. B.S. Latrodectus Mactans Productions/Infernatron Animation Concepts, Canada. Cosgrove Watt, P. A. Heaven, Everard Maynell, Pam Heath; partial animation; 35 mm.; 65 minutes; black and white; sound. The figure of Death (Heath) presides over the front entrance of a carnival sideshow whose spectators watch performers undergo unspeakable degradations so grotesquely compelling that the spectators' eyes become larger and larger until the spectators themselves are transformed into gigantic eyeballs in chairs, while on the other side of the sideshow tent the figure of Life (Heaven) uses a megaphone to invite fairgoers to an exhibition in which, if the fairgoers consent to undergo unspeakable degradations, they can witness ordinary persons gradually turn into gigantic eyeballs. INTERLACE TELENT FEATURE CARTRIDGE #357-65-65

Posting an Infinite Jest excerpt to your blog is absurdly pretentious, I know. It's worth it to have the quote close at hand. Somehow, it's been coming up a lot lately.

More Politics in Software

Over the past two months, I've been writing a series of posts on the intersection of politics and technology. The series consists of two bookend posts, with a number of focused topic discussions between them; this is the second bookend post.

Programmers are incredibly good at finding stuff to get worked up about. What's your favorite text editor? Vim? emacs? Maybe (god help you) Notepad? gedit? kate? nano? Or maybe you don't use an editor -- ok, then what's your favorite IDE? Eclipse? Visual Studio? NetBeans? Something obscure and language-specific?

Speaking of, what's your favorite language? Python? C? Java? C++? C#? Javascript? Lisp? Haskell?

Astute readers may have picked up on a theme here: Unless you're getting ready to draft a specification or set up a group workflow, none of these questions matter at all. And yet, we're all expected to have strong opinions on them. Conversations like these cement computer science's male-dominated reputation, because they are all about unabashed dick-wavery.

You Can't Legislate Reality

For thousands of years, geometers tried in vain to square the circle -- a task which, in 1882, was mathematically proven to be impossible. A result like this isn't really something you get to debate the specifics of. They call it "proof" for a reason.

That's part of what made the 1897 proceedings of the Indiana General Assembly so bizarre -- because it was there that lawmakers tried to pass a law declaring the problem solved. The bill might well have been passed by the senate, were it not for the intervention of a visiting professor.

This incident is one instance of a theme which recurs whenever legislature collides with math or technology. The legal system just can't seem to wrap its head around how science works. Many are inclined to see malice in this tendency -- a sort of deliberate commitment to backwardness, a gleeful embracement of that which is known to be wrong. Tempting as this is, it's a good rule of thumb never to attribute to malice that which is adequately explained by stupidity.

What that rule of thumb fails to capture, though, is that many cases have plenty of room for stupidity and malice.

In the instance of Indiana's Pi law, the ignorance of certain groups within the legislature was maliciously exploited to feed the egotism of the bill's author, an amateur mathematician trying to make his reputation "solving" impossible problems.

In the instance of the Scopes trial, the scientific illiteracy of certain parties involved was exploited to the benefit of evangelical religious fundamentalists with a well-established track record of using legislation in legally dubious ways.

And in the instance of many recent legal cases concerning copyright, patent law, digital rights management (DRM), intellectual property, cryptography, and hardware design, the ignorance of the legislative and judicial systems on technical matters has been (and continues to be) exploited by avaricious and sometimes malicious vested interests in both government and industry, who use their leverage to advance profoundly antisocial ends.

Cory Doctorow argues compellingly in his recent book, Information Doesn't Want to be Free, that modern attempts at digital rights management (which he refers to using a more general term, "digital locks") are not only futile but also harmful to everyone involved. The essential problem (and here I do Doctorow a great disservice by trying to briefly summarize some of his main points; really, his treatment of the topic is second to none and I can't recommend that book highly enough) is this: What digital rights management schemes try to do is to provide a user with access to a technology, but only for certain purposes -- which, to put it bluntly, is just not possible.

Ignoring Abuse On Your Social Platform Is Not a Neutral Stance

There are some pretty big problems with social media right now. Or, it might be more accurate to say there's one big problem -- but it's really big. The problem is how, in this age, we deal with abuse and harassment online.

It borders on impossible to express the scope of online abuse and harassment. Probably the most famous example is Gamergate, which we're not going to get into here, because I'd rather eat glass than even do that shitstorm the dignity of a summary. Look it up in another tab if you really don't know.

The point is, there are a number of well-known cases where specific individuals have been targeted by huge crowds for harassment and abuse. But there are orders upon orders of magnitude more cases that have not become even remotely as well-known, but which nevertheless have caused very real harm in people's lives.

In 2014, the Pew research center conducted a study on harassment, with some striking findings. The worst forms of abusive harassment targeted women disproportionately more than men. This may not come as a surprise, but the sheer numbers involved are staggering: 26% of women aged 18-24 reported stalking, 25% reported sexual harassment, and 18% reported sustained harassment. The corresponding figures for men were 7%, 13%, and 16%, respectively.

The takeaway is this: If we sincerely care about fostering diversity in online communities -- and we all should -- then the first step is to recognize how abusive harassment disproportionately targets some demographics over others. Otherwise, it is impossible to put together a coherent picture of how these behaviors take place on whatever platform you might be dealing with.

It goes beyond harassment, in fact: A recent study suggests that women's contributions to open-source projects on Github tend to be accepted more often than men's -- unless the reviewer knows that the code was submitted by a woman, in which case the acceptance rate plummets. Why is the gender distribution of core developers for major open-source projects so lopsided? Gosh, I wonder.

Sharing Economy Apps and the New Bottle-Wavers

For better or for worse, the modern age has ushered in new 'disruptive' technologies the like of which we have never before seen. The classic example of this is what some people have taken to calling the sharing economy.

The sharing economy, in a nutshell, is based on the idea that while traditionally people have bought good or services from specialized third parties (taxi rides from taxi companies, hotel rooms from hotel companies), people totally would buy these things from each other if there existed a reliable channel to mediate those transactions. What's more, lots of people have services to offer, but no good way to offer them. If you're going out of town for a week, your apartment is just sitting there empty, and empty living space has an inherent value which you are not capitalizing on. Catchphrases like "unused value is wasted value" get thrown around a lot when describing this sort of situation.

Enter "sharing economy" apps. Uber, Lyft, et al., let you play taxi using your very own car. Airbnb lets you play hotel with your own property. The apps are a mediated channel for connecting consumers with providers, and (hopefully) giving each a reasonable level of assurance about the other. Basically, they give you a way to easily rent out things you already own, on your schedule. Stated in the abstract this way, it probably sounds nice. And a lot of the time, it is. But it also has its share of failures, and most people seem to turn a blind eye to them, drunk as we are on its successes.

Does UEFI Secure Boot Actually Help Security?

You know, BIOS gets a bad rap. Most people only know it by the splash screen they see when they first boot up, and if they ever have to actually interact with it, what they find is often downright jarring. Flat colors? Keyboard-only navigation? Didn't we leave all this behind decades ago?

Maybe we did in higher-level systems, but not here. And if we're being honest, I've always had a soft spot for those tacky, old-school ASCII menus. They're kind of cute. And UEFI, the successor to BIOS, is so user-friendly it creeps me out a little bit -- you can even use a mouse in it! What kind of low-level interface supports a mouse?

I do have to admit, though, that UEFI fixes some important problems. It can boot from multiple-terabyte hard drives, which apparently people need these days. It has networking capabilities that BIOS couldn't dream of. UEFI is more broadly portable across different processors, which helps with security and stability.

That's the good. There's also lots of bad. We could talk about UEFI's negligence towards long-standing device driver issues, but that's nothing next to Microsoft's darling, the UEFI "Secure Boot" feature. Secure Boot is borderline functional for Windows users and an unmitigated disaster for everyone else.

Politics in Software

This is the start of a two-month series of posts on the intersection of politics and technology. The series consists of two bookend posts, with a number of focused topic discussions in between; this is the first bookend post. Now that the series is concluded, this post has been lightly edited to add links to the later posts.

Near my family's house in Seattle are two major construction projects. The first is building a new, refurbished waste transfer station; the second, a new corporate headquarters. In spite of the differences in these buildings' purposes, I'm willing to bet that the labor crews for each have pretty similar feelings towards their work. What difference does it make, being a bricklayer for the state or a bricklayer for private industry? Perhaps not much. It's understandable how most people tend to view their work as apolitical.

And yet, in building something that other people are going to use, you are in some sense helping those people, and so perhaps we should give serious thought to who it is we help. In some domains it might not matter much -- certainly there's no shortage of people who can lay down bricks -- but in other domains, very real political shifts can take place without anyone caring or even noticing.

This probably sounds pretty abstract. The goal of the series I'm writing here is to bring this discussion down to earth. I'm going to try to illustrate, through concrete examples, the real and serious political consequences of the choices people make on what projects to support and what projects to ignore.

I'm focusing on software issues. There's a reason for this. A lot of people see software development as "digital bricklaying", and not without good reason: both have the potential to be menial, repetitive, borderline rote tasks with little reward aside from wages. It would be a mistake, though, to let this comparison lead us to assume that software is no more political than other menial crafts. As soon as we get into social issues, the comparison breaks down.

There can be deep political ramifications to software design decisions. Most people turn a blind eye here, or take only a superficial interest, caring about the politics just long enough to let someone convince them they're on the right side, then wandering off in a happy haze to implement some new half-baked idea. Half a year later, that idea is raining down all sorts of unintended consequences. This is the sort of thing we would call naïveté, if it were harmless. But when it impacts people's lives, we don't have the luxury of being so kind.

The frame, fully assembled, with trays

It's hard to play with a Raspberry Pi and not wonder what you could do with a bunch of them wired together. We're talking about tiny computers with 1GB RAM, a quad-core 900MHz processor, and a GPU, all on a credit card-sized board. They're not too far off, specs-wise, from the ThinkPad X60 I'm typing this post on. All that with a $40 price tag -- how could you not want as many as possible?

It probably wouldn't surprise anyone who knows me to learn that I've been working on doing exactly this: wiring a bunch of Raspberry Pis into a cluster for distributed computing. I've now collected enough of the hardware that the big priority is building a rack to keep it all organized. There are a bunch of cool Pi cluster rack designs out there. One 3D-printed design stands out in particular, both for the idea behind it and the implementation. I considered a few different designs, but none were a perfect match for my goals. In the end, it seemed like the best option was to roll my own rack.

Thoughts on some 2015 Putnam Exam Problems

Now that a few days have gone by, I thought I'd write a few notes on things I found interesting while working on this year's Putnam exam.

This year's problems seemed pretty different from last year's. That year, I had trouble making use of all my time; this year, I found myself working into the last five minutes in each session. The A section felt very approachable in particular; B was certainly more arcane, but still didn't feel prohibitively so. There were interesting problems to talk about on both sides.

A PDF with this year's problems can be found attached to the first post here: (link)

this is just to say

i have drunk
the protein shake
that was in
the fridge

and which
you were probably
saving
for breakfast

it was
so gross
what was
i thinking

What Does the Internet of Things Mean for Security?

We are standing at the edge of a steep hill full of very sharp rocks, and internet-connected hardware manufacturers are trying to push us over.

Imagine you woke up one day to find out that overnight you lost access to every account you use online -- Facebook, Twitter, Gmail, you name it. Worse, because all your password resets ran through your Gmail account, there's no easy way to get these accounts back. Now imagine that all of this happened because you bought the wrong fridge.

This sounds like an absurd thought experiment. It isn't. For those few who're unlucky enough to own the vulnerable model of 'smart' fridge, this could actually happen. This is reality. It's made by Samsung and costs about $3600. If an attacker can get within radio distance of the fridge, they could take over the fridge owner's Google account without breaking a sweat.

Once we've got that (imagining ourselves now in the attacker's shoes, taking advantage of some poor sap who dropped close to four grand on an absurd fridge), we can pull up their other accounts, hit that big fat "password reset" button, and get a link delivered to your new mailbox inviting you to set the account's password to whatever you like. Note that the strength of the original passwords has no bearing on whether this attack can work.

Diving Deeper Into Deep Dream: Different Distortions

The moment you get Google Research's Deep Dream project set up, you can make it do incredible things. If you want to generate trippy zooms with lots of dogs and eyes, the code Google provides works more or less out-of-the-box.

For reference, here is a gist for a .py file containing all the code from the notebook. This is meant to be run through an interactive interpreter, e.g. by providing the -i flag at runtime or, in iPython, starting a new notebook and then executing e.g. "run demo.py".

And, here is a gist for a similar but extended piece of code which takes an input file on the command line and performs the iterative process described at the very end of the researchers' original iPython notebook. Both these files are almost entirely composed of code from that notebook; they're just provided as references, so we can have some well-established starting points from which to branch out.

Prelude

Cards on the table. I'm not here to teach you about neural nets or about Deep Dream. I'm here to show you some really fucking cool graphics (watch that link to the end, I promise it's worth it), and I'm here to give you some ideas about how you could make your own. If you aren't interested in the blow-by-blow, and want to hurry up and get to the blow-my-mind, just pop on down to the bottom of the post, where you'll find a list of every graphic linked to in the body here, compiled for your browsing pleasure.

Setting Up Deep Dream, Google Research's Hallucinatory Work of Genius

Google Research wrote recently about a technique they call "Inceptionism", which needs to be seen to be believed. Click through and check out their pictures. A full gallery of official images can be found here. The techniques used to generate these sorts of images were described in broad strokes in this blog post, but the level of detail stopped just short of what someone wanting to replicate their results might have wanted.

That is, until they published their source code.

In this repo (which was created just this morning, as of this post's writing) is an IPython notebook which contains everything necessary to duplicate the results you see in their blog post. That's so cool that it's honestly a little hard to believe. The dependencies are detailed in the repo. There are surprisingly few of them, and they're all surprisingly easy to get set up. Let me walk you through what I had to do on my Debian Stretch system to get everything up and running.

Evaluating the Alternating Harmonic Series

Here's a fun little note I just found while organizing my desk. I came up with this proof during a class on infinite series last summer. The professor mentioned the limit of the alternating harmonic series, then commented offhand that "this limit is true, but we cannot prove it yet." Never one to shy from a challenge, I decided to see whether he was right, and had roughly this proof scrawled on a piece of paper by the end of class. He suggested that I try to publish it in an undergrad journal, but that didn't pan out for various reasons. Nevertheless, it's a fun proof, and I wasn't able to find it anywhere online, so I thought I'd share it here.

The alternating harmonic series is defined as

which, of course, is just the harmonic series with the sign of every other term flipped.

It is well known that the harmonic series diverges. Slightly less well known is that the alternating harmonic series converges to a very clean limit -- strangely enough, it turns out to be ln 2. This fact, if it comes up, is usually proved just as a trivial consequence of the Euler-Mascheroni constant's most common identity. Not only is that proof beyond the reach of many undergrads, it's also so trivial as to border on inane.

It's like using a sledgehammer to pound in a nail: of course it works, but the result would look a lot nicer if you used a hammer instead.

Finding Files with Multiple Substrings

I'd like to share with you a challenge I ran into last night, a problem that's really difficult unless you take to the command line.

Some background: I play Go. The GoGoD game database, which I recently bought and downloaded, has somewhere on the order of 80,000 Go game records in it. These games are all in .sgf format, which is a simple plaintext format. Records of games from professionals of many different ranks are included. The rating scale for professionals starts at 1 dan (or "shodan") and goes up to 9 dan, the highest rating a player can achieve. Games between professionals are always interesting, but games between 9dan professionals are, in particular, packed with subtlety and value for the ambitious student of Go.

Any game from this database is worth studying, but with over eighty thousand games to choose from, it helps to be able to pick highlights, and what better way than by focusing on games between 9dans?

If we decide to take this route, we're immediately faced with a second question: sure, we can just open games at random until we find one between 9dans, but this is boring and slow. Are we doomed to tedium, or is there some way we could search this database of more than eighty thousand games, pick out only the games between 9dans, and copy these over to a special "elite database"?

To the average user, this might sound impossible. On the command line, it's almost trivial.

In Defense of Honors Programs (or, Read a Fucking Book)

In planning the follow-up to a survival skills event for incoming Honors freshmen, the question came up of how best to articulate to new students the value of WWU's honors program. Especially for STEM students, the fact is that the liberal-arts-focused Honors courseload can be expected to overlap very little (if at all) with their major's courses. That's a non-negligible number of extra credits, and there's nothing scarier than a delay in graduation, and so lots of people in the program question, at one point or another, whether the program is worth seeing through to the end. It's our job to try and make the case that it is.

Of course, regardless of obligation, it'd be disingenuous of us to argue this point if we did not ourselves wholly believe it. For my part, I am lucky in this regard, because in spite of (because of?) being a Math/CS double major, I'm convinced that the honors classes I've taken have on average been far more valuable than anything I've taken in either of my majors. Ditto for the classes I'm planning on taking. That's not because I don't like my majors -- far from it, I'm crazy about them. But computer science is close being treated as a trade skill, and math's "beauty cold and austere, like that of sculpture" (Russell) nourishes only certain parts of the mind. Both majors are fantastic for what they are, but what they are is not a complete education. Similar things could be said for practically any other major: They leave gaps, and honors fills those gaps.

My first, informal attempt at articulating to my co-planners this argument for why to stick with honors went something along these lines:

Why stay with honors? Because we offer things you can't get anywhere else. We have classes you can take as a freshman that're on par with the upper-division classes in specialized majors. You can just be like, "I want to learn about Russian literature this quarter" or "I want to learn about film this quarter" and nobody's going to be like, what the fuck, you're a chem major. These kinds of classes are just there, and anyone in the program can take them. If you don't want to be an English major but you want to study great books, we're your best shot. If you want to be in rooms full of clever people talking about Nabokov or Robbe-Grillet or Bely or Dante or for that matter anyone else worth reading, we've got you covered. We've got some of the best professors in the university teaching our classes. Great classes, taught by great professors -- what more do you fucking want?

I stand fully behind that sentiment, in spite of having been explicitly forbidden from ever putting it in such terms at official honors events. One struggles to imagine why.

Perhaps it's because this claim, despite being (I maintain) valid in what it says, doesn't go far enough. Because it's not just that you can study these things if you want to. Yes, you can, and that's amazing. We offer stellar classes that your average student in any employable major would have no other equivalent for. It's hard to complain about that. But that's not everything. It gets more personal.

2015 iCTF Retrospective

After some delays, last Friday saw the 2015 UCSB iCTF come and go, and it was quite the experience. This is (to my knowledge) the first year that Western has joined the competition. We didn't exactly place first, but it was great fun and lots of lessons were learned. Now that it's over and we've had a few days to reflect, here are some thoughts on what we did right, what we could've done better, and what I didn't go in expecting.

There were two big surprises, for myself and (I think) the rest of the team: First, the sheer mind-boggling number of exposed vulnerable services -- upwards of 40, I'm pretty sure -- and second, the inability to directly attack other teams.

You know what?

It doesn't seem like there's any good way to handle armchair "philosophers" who think they have something deep to say about the idea of certainty. "See, the thing is, you can't actually know anything for sure! I mean, like, what if we're all just brains in vats?" The only compelling argument I see for this point proceeds by example: The claim is that nobody can truly know anything whatsoever. Anybody who can in all sincerity make this claim must know very little -- perhaps little enough to serve as supporting evidence.

Joking aside, this is one of the silliest ideas out there. Those who contract it generally do so within a year of their first serious exposure to the great philosophers, in that honeymoon phase where everything makes sense and the ideas of "discarding all assumptions" and "questioning everything" still seem novel. Socrates and Descartes are particularly severe offenders, as brief or unconsidered treatments of their works can easily give the impression that they advocated this total uncertainty. What, more precisely, is this misconstrued idea, though?

It's the idea that we just can't know anything for real, man. You know? Like, think about it -- everything we know comes from observing the world, via our senses, right, and how can we trust our senses? What if we all see, like, different colors? What if two plus two doesn't always equal four? What if it could equal five, but we don't believe that because we've never seen it? Or because we can't see it? Would we even know how to count if we were all blind? I, like... I dunno, dude. I dunno. Whoa. We don't even know that. We don't know anything. Not really.

Symbolic Computational Geometry in Python: Heron's Formula

Floating point numbers suck. Sometimes they're the best we've got, but they still suck. Floating point numbers are to math as Physics 101 lab measurements are to the laws of physics: you get basically what you expect, but with a margin of error tacked on. Sometimes that's okay. Sometimes it's not. But what choice do we have? I'm so glad you asked, because it turns out we have plenty.

But before we get into that, let's convince ourselves that we have a problem to solve. Luckily, this is really easy. Suppose we want to find out, for some terribly important reason, the area of the triangle with vertices at (3,4), (7,10), and (11, 161). We don't get pen and paper -- in fact, the only tool at our disposal is a Python shell. If we know some geometry, then our naive approach might go something like this:

Cheat Sheet: Windowing in Screen & Vim

As a lazy computer science student, I find myself working on programs over SSH a lot. This tends to pose some challenges. Unless you want to mess around with -X and -Y flags, all you get is one terminal to do all your work in. Having to work in this environment has taught me one lesson: When all you have is a terminal, everything looks like a window manager. Or something. Anyway. There are lots of ways to split up a terminal session, and this post covers some common commands involved.

Cracking General Substitution Ciphers

As some of you have requested, today we'll be talking about paper-and-pencil cryptography and how we can use computers to poke it full of holes!

We're going to start with some background on basic (pre-digital) cryptography. Then, we're going to describe the specific cipher we're attacking here, along with some standard paper-and-pencil approaches to cracking it. Thirdly, we'll discuss one algorithmic approach powered by a dictionary file. Then, we'll look at some Python scripts implementing the discussed ideas. Two different scripts will be provided: A bare-bones one written to be simple to understand, and a batteries-included fancy one which elegantly implements the main logic using coroutines.