What every computer science major should know

Here are my distillations of Matt's seminal blog post: What CS Majors Ought to Know.

I have removed the imperative matter, and left the declarative material; marking off that which I have completed.

The original post distils Matt's lifelong experience as a Computer Scientist and Professor to answer these 4 questions:

  • What should every student know to get a good job?
  • What should every student know to maintain lifelong employment?
  • What should every student know to enter graduate school?
  • What should every student know to benefit society?

Here are his prescriptions:

Portfolio versus resume

  • Every computer science major should build a portfolio.
  • Contributions to open source should be linked and documented.

Technical communication

Specific recommendations

  • Master a presentation tool.

Recommended reading

An engineering core

  • real analysis
  • linear algebra
  • probability
  • physics
  • physics through to electromagnetism
  • multivariable calculus
  • differential equations
  • statistics

Recommended reading

The Unix philosophy

  • comfortable with and practiced in the Unix philosophy of computing.
  • in practice, this means becoming comfortable with the notion of command-line computing, text-file configuration and IDE-less software development.

Specific recommendations

  • fluent in basic Unix, including the ability to:

    • navigate and manipulate the filesystem;
    • compose processes with pipes;
    • comfortably edit a file with emacs and vim
  • create, modify and execute a Makefile for a software project;
  • write simple shell scripts.

    • Find the five folders in a given directory consuming the most space.

      #!/usr/bin/env python3
      
      import os
      import sys
      
      def get_folder_size(start_path):
          """Recursively compute total size (in bytes) of all files under start_path."""
          total_size = 0
          for dirpath, dirnames, filenames in os.walk(start_path):
              for f in filenames:
                  fp = os.path.join(dirpath, f)
                  if not os.path.islink(fp):  # Avoid following symlinks
                      try:
                          total_size += os.path.getsize(fp)
                      except OSError:
                          pass
          return total_size
      
      def main():
          if len(sys.argv) < 2:
              print(f"Usage: {sys.argv[0]} <directory>")
              sys.exit(1)
      
          directory = sys.argv[1]
          if not os.path.isdir(directory):
              print(f"Error: {directory} is not a valid directory.")
              sys.exit(1)
      
          folder_sizes = []
          for item in os.listdir(directory):
              path = os.path.join(directory, item)
              if os.path.isdir(path):
                  size = get_folder_size(path)
                  folder_sizes.append((size, path))
      
          # Sort by size descending
          folder_sizes.sort(key=lambda x: x[0], reverse=True)
      
          # Print top five
          print(f"Top 5 largest folders in '{directory}':")
          for size, folder in folder_sizes[:5]:
              print(f"{folder}\t{size} bytes")
      
      if __name__ == "__main__":
          main()
    • Report duplicate MP3s (by file contents, not file name) on a computer.

      #!/usr/bin/env python3
      
      import os
      import sys
      import hashlib
      
      def hash_file(filepath, blocksize=65536):
          """Return SHA-256 hash of the given file."""
          hasher = hashlib.sha256()
          with open(filepath, "rb") as f:
              buf = f.read(blocksize)
              while buf:
                  hasher.update(buf)
                  buf = f.read(blocksize)
          return hasher.hexdigest()
      
      def main():
          if len(sys.argv) < 2:
              print(f"Usage: {sys.argv[0]} <directory>")
              sys.exit(1)
      
          root_dir = sys.argv[1]
          if not os.path.isdir(root_dir):
              print(f"Error: {root_dir} is not a valid directory.")
              sys.exit(1)
      
          file_hashes = {}
          for dirpath, dirnames, filenames in os.walk(root_dir):
              for fn in filenames:
                  if fn.lower().endswith(".mp3"):
                      fullpath = os.path.join(dirpath, fn)
                      try:
                          h = hash_file(fullpath)
                          file_hashes.setdefault(h, []).append(fullpath)
                      except Exception as e:
                          print(f"Could not process file {fullpath}: {e}")
      
          # Report duplicates
          found_duplicates = False
          for h, paths in file_hashes.items():
              if len(paths) > 1:
                  found_duplicates = True
                  print("Duplicate MP3s detected:")
                  for p in paths:
                      print(f"  {p}")
                  print()
      
          if not found_duplicates:
              print("No duplicate MP3 files found.")
      
      if __name__ == "__main__":
          main()
    • Take a list of names whose first and last names have been lower-cased, and properly recapitalize them.

      #!/usr/bin/env python3
      
      import sys
      
      def fix_name(line):
          """
          Takes a line like "john doe" (fully lower-cased).
          Returns "John Doe" (properly capitalized).
          Handles middle names if present.
          """
          parts = line.split()
          return " ".join(part.capitalize() for part in parts)
      
      def main():
          """
          Read names from stdin, print re-capitalized names to stdout.
          Usage example:
              cat names.txt | ./recap_names.py
          """
          for line in sys.stdin:
              line = line.strip()
              if line:
                  print(fix_name(line))
      
      if __name__ == "__main__":
          main()
    • Find all words in English that have x as their second letter, and n as their second-to-last.

      #!/usr/bin/env python3
      
      def main():
          """
          Searches in /usr/share/dict/words (common location on Linux).
          Prints words where:
            - the second letter is 'x'
            - the second-to-last letter is 'n'
          """
          dictionary_path = "/usr/share/dict/words"
          try:
              with open(dictionary_path, "r") as f:
                  for word in f:
                      w = word.strip()
                      if len(w) > 2:
                          # second letter = w[1], second-to-last = w[-2]
                          if w[1] == 'x' and w[-2] == 'n':
                              print(w)
          except FileNotFoundError:
              print(f"Dictionary not found at {dictionary_path}.")
              print("Please adjust the path or provide a suitable word list file.")
      
      if __name__ == "__main__":
          main()
    • Directly route your microphone input over the network to another computer's speaker.

      #!/usr/bin/env bash
      #
      # Usage:
      #   ./stream_mic.sh <destination_host> <destination_port>
      #
      # Example:
      #   ./stream_mic.sh 192.168.1.50 9999
      #
      # Requirements:
      #   - arecord (ALSA) or sox/other recorder
      #   - netcat (nc)
      
      DEST_HOST="$1"
      DEST_PORT="$2"
      
      if [[ -z "$DEST_HOST" || -z "$DEST_PORT" ]]; then
        echo "Usage: $0 <destination_host> <destination_port>"
        exit 1
      fi
      
      # Record from the microphone at CD-quality and send over netcat
      arecord -f cd | nc "$DEST_HOST" "$DEST_PORT"
    • Replace all spaces in a filename with underscore for a given directory.

      #!/usr/bin/env python3
      
      import os
      import sys
      
      def main():
          if len(sys.argv) < 2:
              print(f"Usage: {sys.argv[0]} <directory>")
              sys.exit(1)
      
          directory = sys.argv[1]
          if not os.path.isdir(directory):
              print(f"Error: {directory} is not a valid directory.")
              sys.exit(1)
      
          for filename in os.listdir(directory):
              if ' ' in filename:
                  new_filename = filename.replace(' ', '_')
                  old_path = os.path.join(directory, filename)
                  new_path = os.path.join(directory, new_filename)
                  try:
                      os.rename(old_path, new_path)
                      print(f"Renamed '{filename}' -> '{new_filename}'")
                  except Exception as e:
                      print(f"Failed to rename '{filename}': {e}")
      
      if __name__ == "__main__":
          main()
    • Report the last ten errant accesses to the web server coming from a specific IP address.

      #!/usr/bin/env python3
      
      import sys
      from collections import deque
      
      def main():
          """
          Usage:
              ./errant_accesses.py /path/to/access.log 1.2.3.4
          Assumes Apache-style logs with status code at index [8].
          'Errant' is defined here as 4xx or 5xx response codes.
          """
          if len(sys.argv) < 3:
              print(f"Usage: {sys.argv[0]} <logfile> <IP_address>")
              sys.exit(1)
      
          logfile = sys.argv[1]
          ip_addr = sys.argv[2]
      
          last_ten = deque(maxlen=10)
      
          try:
              with open(logfile, "r") as f:
                  for line in f:
                      if ip_addr in line:
                          parts = line.split()
                          # Typical Apache log format:
                          #   IP - - [date] "GET /endpoint" status_code ...
                          if len(parts) > 8:
                              status_code = parts[8]
                              if status_code.startswith('4') or status_code.startswith('5'):
                                  last_ten.append(line.rstrip("\n"))
          except FileNotFoundError:
              print(f"Log file not found: {logfile}")
              sys.exit(1)
          except PermissionError:
              print(f"Permission denied reading log file: {logfile}")
              sys.exit(1)
      
          print(f"Last 10 errant (4xx/5xx) accesses from IP {ip_addr} in {logfile}:")
          for entry in last_ten:
              print(entry)
      
      if __name__ == "__main__":
          main()

Systems administration

  • computer scientists must be able to competently and securely administer their own systems and networks.

    • c.f. raspberry pi projects, and this site along with abaj.bots and abaj.games all running out of the same vps

Specific recommendations

  • Install and administer a Linux distribution.
  • Configure and compile the Linux kernel.
  • Troubleshoot a connection with dig, ping and traceroute.
  • Compile and configure a web server like apache.

    • done multiple times. running a custom build of nginx for this site
  • Compile and configure a DNS daemon like bind.
  • Maintain a web site with a text editor.
  • Cut and crimp a network cable.

Recommended reading

by Nemeth, Synder, Hein and Whaley.

Programming languages

Specific languages

  • Assembly.

Discrete mathematics

  • solid grasp of formal logic and of proof.
  • proof by algebraic manipulation and by natural deduction
  • proof by induction
  • fluent in formal mathematical notation, and in reasoning rigorously about the basic discrete structures: sets, tuples, sequences, functions and power sets.

Specific recommendations

  • reason clearly about:

    • trees;
    • graphs;
    • formal languages; and
    • automata.
  • learn enough number theory to study and implement common cryptographic protocols.

Recommended reading

Data structures and algorithms

  • understand how to design algorithms (e.g., greedy, dynamic strategies)
  • and how to span the gap between an algorithm in the ideal and the nitty-gritty of its implementation.

Specific recommendations

  • hash tables;
  • linked lists;
  • trees;
  • binary search trees; and
  • directed and undirected graphs.
  • know both the imperative and functional versions of each algorithm.

Recommended reading

Theory

  • models of computation and computational complexity.

    • computation: should cover finite-state automata, regular languages (and regular expressions), pushdown automata, context-free languages, formal grammars, Turing machines, the lambda calculus, and undecidability.
    • difference between P, NP, NP-Hard and NP-Complete.
  • solve a few large problems in NP by reduction to SAT and the use of modern SAT solvers.

Recommended reading

Architecture

  • understand a computer from the transistors up.

    • transistors, gates, adders, muxes, flip flops, ALUs, control units, caches and RAM.
  • GPU model of high-performance computing

Specific recommendations

  • good understanding of caches, buses and hardware memory management is essential to achieving good performance on modern systems.
  • to get a good grasp of machine architecture, students should design and simulate a small CPU.

Recommended reading

Operating systems

  • be aware of how kernels handle system calls, paging, scheduling, context-switching, filesystems and internal resource

management.

A good understanding of operating systems is secondary only to an understanding of compilers and architecture for achieving performance.

Specific recommendations

  • get hands dirty on a real operating system. (With Linux and virtualization, this is easier than ever before.)
  • To get a better understanding of the kernel, students could:

    • print "hello world" during the boot process;
    • design their own scheduler;
    • modify the page-handling policy; and
    • create their own filesystem.

Recommended reading

Networking

  • firm understanding of the network stack and routing protocols within a network.
  • mechanics of building an efficient, reliable transmission protocol (like TCP) on top of an unreliable transmission protocol (like IP) should not be magic to a computer scientist.
  • must understand the trade-offs involved in protocol design–for example, when to choose TCP and when to choose UDP.
  • Programmers need to understand the larger social implications for congestion should they use UDP at large scales as well.

Specific recommendations

  • know the protocols for existing standards, such as:

    • 802.3 and 802.11;
    • IPv4 and IPv6; and
    • DNS, SMTP and HTTP.
  • understand exponential back off in packet collision resolution and the additive-increase multiplicative-decrease mechanism involved in congestion control.
  • implement the following:

    • an HTTP client and daemon;
    • a DNS resolver and server; and
    • a command-line SMTP mailer.
  • no student should ever pass an intro neworking class without sniffing their instructor's Google query off wireshark.
  • [ ]* implement a reliable transmission protocol from scratch atop IP

Recommended reading

Security

Specific recommendations

  • At a minimum, every computer scientist needs to

understand:

  • social engineering;
  • buffer overflows;
  • integer overflow;
  • code injection vulnerabilities;
  • race conditions; and
  • privilege confusion.
  • how to properly configure a firewall with iptables.

Recommended reading

Cryptography

  • understand and be able to implement the following concepts, as well as the common pitfalls in doing so:

    • symmetric-key cryptosystems;
    • public-key cryptosystems;
    • secure hash functions;
    • challenge-response authentication;
    • digital signature algorithms; and
    • threshold cryptosystems.
  • every computer scientist should know how to acquire a sufficiently random number for the task at hand.
  • computer scientists need to know how to salt and hash passwords for storage.

Specific recommendations

  • RSA is easy enough to implement that everyone should do it.
  • Every student should create their own digital certificate and set up https in apache. (It's surprisingly arduous to do this.)
  • Student should also write a console web client that connects over SSL.
  • computer scientists should know how to use GPG;

    • how to use public-key authentication for ssh;
    • and how to encrypt a directory or a hard disk.
Recommended reading

Software testing

Software testing must be distributed throughout the entire curriculum.

User experience design

Programmers too often write software for other programmers, or worse, for themselves.

Visualization

Recommended reading

Parallelism

  • deep knowledge of architecture: multicore, caches, buses, GPUs, etc.
  • and, practice. Lots of practice.

Specific recommendations

  • learn CUDA
  • threads
  • pthreads (the library)

Software engineering

  • a good, hands-on course in the practice of team software construction provides a working knowledge of the pitfalls inherent in the endeavor.

Specific recommendations

  • centralized version control systems
  • working knowlege of debugging tools like gdb and valgrind

Recommended reading

Formal methods

  • learn to use a theorem prover (it immediately impacts coding style.)

    • incomplete switch statements and correctness of recursive functions
  • Software Foundations.

Graphics and simulation

There is no discipline more dominated by "clever" than graphics.

Robotics

Artificial intelligence

Recommended reading

Machine learning

  • Bayesian networks, clustering and decision-tree learning.

Recommended reading

Databases

  • Relational algebra and relational calculus
  • ER modeling

Specific recommendations

  • set up and operate a LAMP stack

Recommended reading

Non-specific reading recommendations