Software at Scale 3 - Bharat Mediratta: ex-CTO, Dropbox

Published: Dec. 19, 2020, 6:32 a.m.

Bharat Mediratta was a Distinguished Engineer at Google, CTO at Altschool, and CTO at Dropbox. At Google, he worked on GWS (Google Web Server), a system that I\u2019ve always been curious about, especially since its Wikipedia entry calls it \u201cone of the most guarded components of Google's infrastructure\u201d.

In this podcast, we discuss GWS, bootstrapping a culture of testing at Google, breaking up services to be more manageable, monorepos, build systems, the ethics of software at scale, and more. We spent almost an hour and a half, and didn\u2019t even manage to cover his experiences at Altschool or Dropbox (which hopefully will be covered in a follow up).

Listen on Apple Podcasts or Spotify.

Highlights

Notes are italicized.

0:20 - Background - Childhood interests in technology. His dad was a director at ADE, India. His dad recruited APJ Abdul Kalam, arguably one of India\u2019s most popular Presidents, and kick-started his career.

6:10 - Studying tech in university. Guru Meditation errors.

10:50 - Working at Sun Microsystems as a first job.

12:30 - Transitioning from being a programmer to a leader, and thinking about project plans and deadlines.

14:15 - Working on side projects for the company (a potential inspiration for 20% projects at Google?)

15:30 - Moving from Sun to a few startups to Google. How did 20% projects start?

16:50 - Google News as a 20% project. Apparently 20% projects has its own wikipedia page.

18:24 - Did 20% time require management approval?

19:30 - TK at Google Cloud, and how the management model compares to early Google

21:00 - Declining an offer from Google at 2002, and going to VA Linux instead.

22:28 - Growth at Google from 2004 onwards.

24:28 - Hiring at Google at that time. \u201cA players hire A players, B players hire C players\u201d.

24:55 - Culture Fit (indoctrination)? Two weeks of \u201cfairly intense education\u201d, a Noogler project, and a general investment of time and money to help explain the Google way of doing things. It wasn\u2019t accidental. I went through this in 2016 and definitely learnt a bunch, especially from an intriguing talk called \u201cLife of a Query\u201d.

27:22 - Culturally integrating acquisitions successfully. YouTube as an example.

28:40 - Differences between Google and YouTube, and other acquisitions like Motorola Mobility.

30:20 - Search/Google Web Server (GWS) only had 3 nines of availability? The difference between a forager and a refiner (in terms of programming)

31:15 - What was GWS? Server responsible for Google Home and Google Search.

32:20 - There was only one infrastructure engineer on GWS at the time (who wanted to switch), but about a hundred engineers made changes to it every week.

33:10 - Starting with writing unit tests for this system.

33:40 - \u201cThey\u201d used to call GWS \u201cthe neck of Google\u201d. Extremely critical, but also extremely fragile. Search results and 98% of revenue came through this system. One second of downtime implied revenue loss. Rewriting was infeasible.

34:50 - How to use unit tests to create a culture of shared understanding. Bharat released a manifesto that basically said \u201call changes to GWS required unit tests\u201d. This caused massive consternation at the time.

36:10 - A quick example on how to enforce unit tests on new code. If an engineer didn\u2019t add a new unit test, Bharat would write the test for them, which often would be failing due to a bug in engineer\u2019s code. This led to a culture where engineers realized the value of writing these tests (and implicitly

39:23 - New Googlers were taught to write unit tests, so that new engineers would spread a culture of writing tests. \u201cOh, everyone writes unit tests at Google\u201d.

41:50 - \u201cWhat kind of features were those hundreds of engineers adding to GWS?\u201d. An example - adding UPS tracking numbers automatically showed you UPS tracking results. These were all quiet launches.

Some of the software design around experimentation towards Google search might have influenced Optimizely\u2019s design.

45:00 - Google\u2019s search page in 2007 was pure HTML. In 2009, it was completely AJAX based. This was a massive shift that happened transparently for users.

46:00 - \u201cWe wanted Search to be a utility. We wanted it to be the air you breath. You don\u2019t turn on the faucet and worry that water doesn\u2019t come out.\u201d

47:40 - The evolution of GWS\u2019s architecture. Initially, very monolithic. GWS would talk to indices, get results, rank results, and send back HTML. This eventually was broken into layers. Each layer had responsibility, and the plan was to stick to that.

49:00 - \u201cYou could find one line of code in GWS that had C code, C++ code, HTML, Javascript, and CSS output\u201d. Wow.

The number one query at Google at the time was \u201cYahoo\u201d - a navigational search query.

50:00 - Google Instant was rolled out in 2010. Internally, this was called \u201cGoogle Psychic\u201d, cause it was pretty good at predicting what users wanted to search.

51:50 - \u201cA rewrite would have been a disaster\u201d. GWS was essentially refactored from inside out every 18 months for 11 years. The first one - was breaking out ranking from GWS to another service.

57:00 - YouTube knew that if it convinced enough people to get better internet, Google would make more revenue.

59:00 - Search grew from 500-1000 people in 2004, to 3000 people in 2010.

59:30 - How exactly did search ranking work, technically and organizationally? The Long Click.

61:40 - Google ran 20+ experiments to figure out the best shade of blue on the Search page. This might seem silly, but it helps at scale, since it could potentially find the shade that would help the most colorblind individuals.

67:50 - Hate speech from Google search, and the ethical quandaries around building a humanity scale system.

70:30 - Improving iteration speed and developer productivity for these systems

71:50 - Google had an ML model for search results back in 2004 that was competitive with the hand-built systems, but didn\u2019t end up using it, due to the lack of understandability. This has definitely changed now. I had read that document during my internship, but was surprised to learn that Google had a working ML model for ranking since 2004.

73:30 - Service Oriented Architecture at Google. Enabled GWS from C to C++ and divest itself from some responsibilities. But Google stuck with a monorepo, compared to Amazon.

76:40 - Components in the Monorepo + Blaze (Bazel) helped Google scale build times and reduce iteration speed. Components is the most interesting piece, since to my understanding, it hasn\u2019t been written about much externally.

78:00 - The scale and complexity of the monorepo.

79:40 - The 400,000 line Makefile, and the start of Blaze.

82:00 - What were the benefits of \u201cComponents\u201d?

84:00 - The project to multi-thread GWS, when it was serving 5 - 10 billion search queries a day. It started off as a practical joke.

91:00 - It\u2019s rarely only about the technology. It\u2019s about culture and team cohesion.



This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev