What every computer science major should know
Here are my distillations of Matt's seminal blog post: What CS Majors Ought to Know.
I have removed the imperative matter, and left the declarative material; marking off that which I have completed.
The original post distils Matt's lifelong experience as a Computer Scientist and Professor to answer these 4 questions:
- What should every student know to get a good job?
- What should every student know to maintain lifelong employment?
- What should every student know to enter graduate school?
- What should every student know to benefit society?
Here are his prescriptions:
Portfolio versus resume
- Every computer science major should build a portfolio.
- Contributions to open source should be linked and documented.
Technical communication
Specific recommendations
- Master a presentation tool.
Recommended reading
- Writing for Computer Science by Zobel.
- Even a Geek Can Speak by Asher.
- The LaTeX Companion.
- The TeXbook by Knuth. (Warning: Experts only.)
- Notes on Mathematical Writing
- Simon Peyton-Jones's advice on How to Give a Good Research Talk
- My advice on how to send and reply to email.
An engineering core
- real analysis
- linear algebra
- probability
- physics
- physics through to electromagnetism
- multivariable calculus
- differential equations
- statistics
Recommended reading
- Calculus by Spivak.
- All of Statistics: A Concise Course in Statistical Inference by Wasserman.
The Unix philosophy
- comfortable with and practiced in the Unix philosophy of computing.
- in practice, this means becoming comfortable with the notion of command-line computing, text-file configuration and IDE-less software development.
Specific recommendations
-
fluent in basic Unix, including the ability to:
- navigate and manipulate the filesystem;
- compose processes with pipes;
- comfortably edit a file with emacs and vim
- create, modify and execute a Makefile for a software project;
-
write simple shell scripts.
-
Find the five folders in a given directory consuming the most space.
#!/usr/bin/env python3 import os import sys def get_folder_size(start_path): """Recursively compute total size (in bytes) of all files under start_path.""" total_size = 0 for dirpath, dirnames, filenames in os.walk(start_path): for f in filenames: fp = os.path.join(dirpath, f) if not os.path.islink(fp): # Avoid following symlinks try: total_size += os.path.getsize(fp) except OSError: pass return total_size def main(): if len(sys.argv) < 2: print(f"Usage: {sys.argv[0]} <directory>") sys.exit(1) directory = sys.argv[1] if not os.path.isdir(directory): print(f"Error: {directory} is not a valid directory.") sys.exit(1) folder_sizes = [] for item in os.listdir(directory): path = os.path.join(directory, item) if os.path.isdir(path): size = get_folder_size(path) folder_sizes.append((size, path)) # Sort by size descending folder_sizes.sort(key=lambda x: x[0], reverse=True) # Print top five print(f"Top 5 largest folders in '{directory}':") for size, folder in folder_sizes[:5]: print(f"{folder}\t{size} bytes") if __name__ == "__main__": main()
-
Report duplicate MP3s (by file contents, not file name) on a computer.
#!/usr/bin/env python3 import os import sys import hashlib def hash_file(filepath, blocksize=65536): """Return SHA-256 hash of the given file.""" hasher = hashlib.sha256() with open(filepath, "rb") as f: buf = f.read(blocksize) while buf: hasher.update(buf) buf = f.read(blocksize) return hasher.hexdigest() def main(): if len(sys.argv) < 2: print(f"Usage: {sys.argv[0]} <directory>") sys.exit(1) root_dir = sys.argv[1] if not os.path.isdir(root_dir): print(f"Error: {root_dir} is not a valid directory.") sys.exit(1) file_hashes = {} for dirpath, dirnames, filenames in os.walk(root_dir): for fn in filenames: if fn.lower().endswith(".mp3"): fullpath = os.path.join(dirpath, fn) try: h = hash_file(fullpath) file_hashes.setdefault(h, []).append(fullpath) except Exception as e: print(f"Could not process file {fullpath}: {e}") # Report duplicates found_duplicates = False for h, paths in file_hashes.items(): if len(paths) > 1: found_duplicates = True print("Duplicate MP3s detected:") for p in paths: print(f" {p}") print() if not found_duplicates: print("No duplicate MP3 files found.") if __name__ == "__main__": main()
-
Take a list of names whose first and last names have been lower-cased, and properly recapitalize them.
#!/usr/bin/env python3 import sys def fix_name(line): """ Takes a line like "john doe" (fully lower-cased). Returns "John Doe" (properly capitalized). Handles middle names if present. """ parts = line.split() return " ".join(part.capitalize() for part in parts) def main(): """ Read names from stdin, print re-capitalized names to stdout. Usage example: cat names.txt | ./recap_names.py """ for line in sys.stdin: line = line.strip() if line: print(fix_name(line)) if __name__ == "__main__": main()
-
Find all words in English that have x as their second letter, and n as their second-to-last.
#!/usr/bin/env python3 def main(): """ Searches in /usr/share/dict/words (common location on Linux). Prints words where: - the second letter is 'x' - the second-to-last letter is 'n' """ dictionary_path = "/usr/share/dict/words" try: with open(dictionary_path, "r") as f: for word in f: w = word.strip() if len(w) > 2: # second letter = w[1], second-to-last = w[-2] if w[1] == 'x' and w[-2] == 'n': print(w) except FileNotFoundError: print(f"Dictionary not found at {dictionary_path}.") print("Please adjust the path or provide a suitable word list file.") if __name__ == "__main__": main()
-
Directly route your microphone input over the network to another computer's speaker.
#!/usr/bin/env bash # # Usage: # ./stream_mic.sh <destination_host> <destination_port> # # Example: # ./stream_mic.sh 192.168.1.50 9999 # # Requirements: # - arecord (ALSA) or sox/other recorder # - netcat (nc) DEST_HOST="$1" DEST_PORT="$2" if [[ -z "$DEST_HOST" || -z "$DEST_PORT" ]]; then echo "Usage: $0 <destination_host> <destination_port>" exit 1 fi # Record from the microphone at CD-quality and send over netcat arecord -f cd | nc "$DEST_HOST" "$DEST_PORT"
-
Replace all spaces in a filename with underscore for a given directory.
#!/usr/bin/env python3 import os import sys def main(): if len(sys.argv) < 2: print(f"Usage: {sys.argv[0]} <directory>") sys.exit(1) directory = sys.argv[1] if not os.path.isdir(directory): print(f"Error: {directory} is not a valid directory.") sys.exit(1) for filename in os.listdir(directory): if ' ' in filename: new_filename = filename.replace(' ', '_') old_path = os.path.join(directory, filename) new_path = os.path.join(directory, new_filename) try: os.rename(old_path, new_path) print(f"Renamed '{filename}' -> '{new_filename}'") except Exception as e: print(f"Failed to rename '{filename}': {e}") if __name__ == "__main__": main()
-
Report the last ten errant accesses to the web server coming from a specific IP address.
#!/usr/bin/env python3 import sys from collections import deque def main(): """ Usage: ./errant_accesses.py /path/to/access.log 1.2.3.4 Assumes Apache-style logs with status code at index [8]. 'Errant' is defined here as 4xx or 5xx response codes. """ if len(sys.argv) < 3: print(f"Usage: {sys.argv[0]} <logfile> <IP_address>") sys.exit(1) logfile = sys.argv[1] ip_addr = sys.argv[2] last_ten = deque(maxlen=10) try: with open(logfile, "r") as f: for line in f: if ip_addr in line: parts = line.split() # Typical Apache log format: # IP - - [date] "GET /endpoint" status_code ... if len(parts) > 8: status_code = parts[8] if status_code.startswith('4') or status_code.startswith('5'): last_ten.append(line.rstrip("\n")) except FileNotFoundError: print(f"Log file not found: {logfile}") sys.exit(1) except PermissionError: print(f"Permission denied reading log file: {logfile}") sys.exit(1) print(f"Last 10 errant (4xx/5xx) accesses from IP {ip_addr} in {logfile}:") for entry in last_ten: print(entry) if __name__ == "__main__": main()
-
Recommended reading
- The Unix Programming Environment by Kernighan and Pike.
- The Linux Programming Interface: A Linux and UNIX System Programming Handbook by Kerrisk.
- Unix Power Tools by Powers, Peek, O'Reilly and Loukides.
- commandlinefu.
- Linux Server Hacks.
- The single Unix specification.
Systems administration
-
computer scientists must be able to competently and securely administer their own systems and networks.
- c.f. raspberry pi projects, and this site along with abaj.bots and abaj.games all running out of the same vps
Specific recommendations
- Install and administer a Linux distribution.
- Configure and compile the Linux kernel.
- Troubleshoot a connection with
dig
,ping
andtraceroute
. -
Compile and configure a web server like apache.
- done multiple times. running a custom build of nginx for this site
- Compile and configure a DNS daemon like bind.
- Maintain a web site with a text editor.
- Cut and crimp a network cable.
Recommended reading
by Nemeth, Synder, Hein and Whaley.
Programming languages
- ideally, every computer science major would take a compilers class.
-
At a minimum, every computer science major should implement an interpreter.
Specific languages
-
C;
- ANSI C by Kernighan and Ritchie.
-
JavaScript;
- JavaScript: The Definitive Guide by Flanagan.
- JavaScript: The Good Parts by Crockford.
- Effective JavaScript: 68 Specific Ways to Harness the Power of JavaScript by Herman.
-
Java;
- Effective Java by Bloch.
-
Haskell;
- Learn You a Haskell by Lipovaca.
- Real World Haskell by O'Sullivan, Goerzen and Stewart.
-
C++; and
- The C++ Programming Language by Stroustrup.
- C++ Templates: The Complete Guide by Vandevoorde and Josuttis.
- Programming Pearls by Bentley.
-
Assembly.
- generative programming (macros);
- lexical (and dynamic) scope;
- closures;
- continuations;
- higher-order functions;
- dynamic dispatch;
- subtyping;
- modules and functors;
- monads as semantic concepts distinct from any specific syntax.
- Structure and Interpretation of Computer Programs by Abelson, Sussman and Sussman.
- Lisp in Small Pieces by Queinnec.
Discrete mathematics
- solid grasp of formal logic and of proof.
- proof by algebraic manipulation and by natural deduction
- proof by induction
- fluent in formal mathematical notation, and in reasoning rigorously about the basic discrete structures: sets, tuples, sequences, functions and power sets.
Specific recommendations
-
reason clearly about:
- trees;
- graphs;
- formal languages; and
- automata.
- learn enough number theory to study and implement common cryptographic protocols.
Recommended reading
- How to Prove It: A Structured Approach by Velleman.
- How To Solve It by Polya.
Data structures and algorithms
- understand how to design algorithms (e.g., greedy, dynamic strategies)
- and how to span the gap between an algorithm in the ideal and the nitty-gritty of its implementation.
Specific recommendations
- hash tables;
- linked lists;
- trees;
- binary search trees; and
- directed and undirected graphs.
- know both the imperative and functional versions of each algorithm.
Recommended reading
- CLRS.
- Any of the Art of Computer Programming series by Knuth.
Theory
-
models of computation and computational complexity.
- computation: should cover finite-state automata, regular languages (and regular expressions), pushdown automata, context-free languages, formal grammars, Turing machines, the lambda calculus, and undecidability.
- difference between P, NP, NP-Hard and NP-Complete.
- solve a few large problems in NP by reduction to SAT and the use of modern SAT solvers.
Recommended reading
- Introduction to the Theory of Computation by Sipser.
- Computational Complexity by Papadimitriou.
- Algorithms by Sedgewick and Wayne.
- Introduction to Algorithms by Cormen, Leiserson, Rivest and Stein.
Architecture
-
understand a computer from the transistors up.
- transistors, gates, adders, muxes, flip flops, ALUs, control units, caches and RAM.
- GPU model of high-performance computing
Specific recommendations
- good understanding of caches, buses and hardware memory management is essential to achieving good performance on modern systems.
- to get a good grasp of machine architecture, students should design and simulate a small CPU.
Recommended reading
- nand2tetris, which constructs a computer from the ground up.
- Computer Organization and Design by Patterson and Hennessy.
- "What every programmer should know about memory" by Drepper.
Operating systems
- be aware of how kernels handle system calls, paging, scheduling, context-switching, filesystems and internal resource
management.
A good understanding of operating systems is secondary only to an understanding of compilers and architecture for achieving performance.
Specific recommendations
- get hands dirty on a real operating system. (With Linux and virtualization, this is easier than ever before.)
-
To get a better understanding of the kernel, students could:
- print "hello world" during the boot process;
- design their own scheduler;
- modify the page-handling policy; and
- create their own filesystem.
Recommended reading
- Linux Kernel Development by Love.
Networking
- firm understanding of the network stack and routing protocols within a network.
- mechanics of building an efficient, reliable transmission protocol (like TCP) on top of an unreliable transmission protocol (like IP) should not be magic to a computer scientist.
- must understand the trade-offs involved in protocol design–for example, when to choose TCP and when to choose UDP.
- Programmers need to understand the larger social implications for congestion should they use UDP at large scales as well.
Specific recommendations
-
know the protocols for existing standards, such as:
- 802.3 and 802.11;
- IPv4 and IPv6; and
- DNS, SMTP and HTTP.
- understand exponential back off in packet collision resolution and the additive-increase multiplicative-decrease mechanism involved in congestion control.
-
implement the following:
- an HTTP client and daemon;
- a DNS resolver and server; and
- a command-line SMTP mailer.
- no student should ever pass an intro neworking class without sniffing their instructor's Google query off wireshark.
- [ ]* implement a reliable transmission protocol from scratch atop IP
Recommended reading
- Unix Network Programming by Stevens, Fenner and Rudoff.
Security
Specific recommendations
- At a minimum, every computer scientist needs to
understand:
- social engineering;
- buffer overflows;
- integer overflow;
- code injection vulnerabilities;
- race conditions; and
- privilege confusion.
- how to properly configure a firewall with iptables.
Recommended reading
- Metasploit: The Penetration Tester's Guide by Kennedy, O'Gorman, Kearns and Aharoni.
- Security Engineering by Anderson.
Cryptography
-
understand and be able to implement the following concepts, as well as the common pitfalls in doing so:
- symmetric-key cryptosystems;
- public-key cryptosystems;
- secure hash functions;
- challenge-response authentication;
- digital signature algorithms; and
- threshold cryptosystems.
- every computer scientist should know how to acquire a sufficiently random number for the task at hand.
- computer scientists need to know how to salt and hash passwords for storage.
Specific recommendations
- RSA is easy enough to implement that everyone should do it.
- Every student should create their own digital certificate and set up https in apache. (It's surprisingly arduous to do this.)
- Student should also write a console web client that connects over SSL.
-
computer scientists should know how to use GPG;
- how to use public-key authentication for ssh;
- and how to encrypt a directory or a hard disk.
Recommended reading
- Cryptography Engineering by Ferguson, Schneier and Kohno.
Software testing
Software testing must be distributed throughout the entire curriculum.
User experience design
Programmers too often write software for other programmers, or worse, for themselves.
Recommended reading
- Paul Graham's essay on Web 2.0.
- "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets" by Spolsky.
- HTML and CSS: Design and Build Websites by Duckett.
- JavaScript: The Definitive Guide by Flanagan.
Visualization
Recommended reading
Parallelism
- deep knowledge of architecture: multicore, caches, buses, GPUs, etc.
- and, practice. Lots of practice.
Specific recommendations
- learn CUDA
- threads
- pthreads (the library)
Software engineering
- a good, hands-on course in the practice of team software construction provides a working knowledge of the pitfalls inherent in the endeavor.
Specific recommendations
- centralized version control systems
- working knowlege of debugging tools like gdb and valgrind
Recommended reading
- Version Control by Example by Sink.
Formal methods
-
learn to use a theorem prover (it immediately impacts coding style.)
- incomplete
switch
statements and correctness of recursive functions
- incomplete
- Software Foundations.
Graphics and simulation
There is no discipline more dominated by "clever" than graphics.
Recommended reading
Robotics
Related posts
Artificial intelligence
Recommended reading
- Artificial Intelligence by Russell and Norvig.
Machine learning
- Bayesian networks, clustering and decision-tree learning.
Recommended reading
- Machine Learning by Mitchell.
Databases
- Relational algebra and relational calculus
- ER modeling
Specific recommendations
- set up and operate a LAMP stack
Recommended reading
- SQL and Relational Theory by Date.
Non-specific reading recommendations
- Gödel, Escher, Bach by Hofstadter.
- Nick Black's advice for MS students.