ProgrammeerTalen

Build Your Own Language

How to build your own object-oriented language from scratch with PyPy

Outline

  1. Background
  2. Environment
  3. Ground rules
    1. Formatting & Naming Conventions
    2. Tests
  4. Deciding on the object model

Background

This page describes how you can build your own language from scratch, based on personal experience. It will describe the steps we go through as we continue working on it. The language we are building is supposed to be a flexible prototype based language with a fairly unified syntax. However, we are not sure yet about the syntax, since as you will see, syntax is totally irrelevant during the first steps.

Environment (or why to use PyPy)

For this tutorial we describe how to write it using PyPy. This means that the language interpreter / virtual machine is fully written in RPython (restricted python). Even while this is currently not production standard, this has multiple advantages, and we are fairly sure that PyPy is the way to build virtual machines / interpreters in the future because of it.

PyPy is a toolchain which allows you to write interpreters in a fairly high-level language (RPython). RPython is a restricted version of Python which means that the interpreter will be fully testable by running it in a Python interpreter. This however does not imply that the language interpreter will always be slow, since PyPy can translate the RPython code down to other languages like C, code for the JVM, CLI, ... Does this mean that the PyPy people have invented a way of statically compiling Python code? No, this is where the R in the name comes from. PyPy has to be able to analyze the code given to it fully and infer all types (as Python is untyped). This places quite some restrictions on the use of Python. However, as it appears, these restrictions are fairly unimportant for building virtual machines. Virtual machines are fairly strict and typeable by nature. So the key element to get from this: PyPy is build -only- for building virtual machines and not for other types of dynamic python programs. Python programs are by default probably not RPython (even while it might be possible to compile other types of accidentally restricted enough programs as well with it).

Next to the fact that by using PyPy we automatically have a very portable virtual machine which will run on JVM/CLI/C/(JavaScript)/... PyPy also automatically provides other important but "annoying", interpreter-independent if you wish, tools that all virtual machines / interpreters will want. This includes a garbage collector and a JIT-compiler. Since we get those for free, we can really focus on the language -we- want to design itself. We do not even need to find tools like that and make sure they work with our interpreter, they are woven in transparently while translating the code to a lower level, as it would be while using the interpreter interpreted by Python as Python has a garbage collector as well.

So briefly, the advantages of using PyPy:

  • Quickly prototype a VM by writing it in a higher level language
  • Portability of the VM by translation
  • Performance of the VM by translation (and JIT compiler)
  • Typical VM-tools for free
    • Garbage collector
    • JIT-compiler

Ground Rules (or the first principles of textual programming)

When starting to write a new language, as with any project, it is very important to set a set of ground rules which you follow through the process. In the beginning they might annoy you, but they are vital especially once you start going deeper in the development; and if you work with multiple persons on the project. Even while you might think now that you will stay small and hack alone on it; I can assure you, for your own mental health it is important to stick to the rules. These are not rules that I (we) found out ourselves, even while we again discovered the need for them ourselves, but that already have a tradition coming from experience in building PyPy itself and languages on top of PyPy. And already long before them...

For now we will focus on two basic ground rule topics, namely the used formatting & naming conventions and tests.

Formatting & Naming Conventions

Before starting to do anything, you need to decide on the style of formatting and naming conventions you will use. This is very important for the consistency in appearance and naming throughout the code. Why is this very important? There a several reasons but among them:

- Different editors might make code mixing tabs and spaces for example, look different: Tabs are equivalent to 8 spaces in some editors, 4 or 2 in others. If you mix spaces and tabs; it might be that for you a line using 1 tab as indentation is indented at the same location as a line of code indented with 3 spaces. But you can surely imagine that most other editors will show code which appears totally unindented. Unindented code is difficult to read. Code which is difficult to read is difficult to understand.

- In some languages (Python...) incorrectly indented code can not even be evaluated

- Using different naming schemes (CamelCase, underscore_separated, dash-separated) will

  1. Make the code look inconsistent which is difficult to read again
  2. Make it difficult for you to remember class, instance variable and method names and force you to go back and forth the code finding which was used again

Next to formatting, naming schemes are very important as well. In some cases it might be very interesting to prefix certain classnames to highlight a certain use. In all cases it is important to use clear names for what your named entities (classes, methods, ...) do and mean. So use comprehensible names. Ie, do not use 'gmfc' but 'get_method_from_class'. This does not slow down the system, it just tells developers what code means. Understanding is the key to evolution and success.

One exception can be made for well-known abbreviations such as sp (stack-pointer), pc (program counter), ... but as I can imagine you might not even have guessed what they meant without me telling you. And then code like self.pm[self.sp] = self.pm[self.pc] (which would basically poke the current expression in the program onto the stack) can look very confusing.

Remember the following rule:

Code is written by programmers for programmers. Binaries are generated by compilers for computers.

If we would not care about this all, we might as well write in binary. But since we are programmers ourselves, and since we want to understand our own code as well, we prefer higher level languages (ok, except for some geeks who love to feel special).

We will use the following scheme which is consistent with PyPy:

  • CamelCase classnames
  • underscore_separated methodnames
  • 4-space indents (no tabs)
  • Classnames which represent models objects of our own language, and which will be first-class in the language, start with W_ (Wrapped)

Tests

Then once you start writing, every time you want to write a new concept, you should provide a test first which documents what the concept should do. This test will of course fail since the concept is not implemented yet. Then you implement the concept which should make your test work. Try to construct your tests as such that they consider borderline-cases. You save these tests in separate files based on the concepts on which you separate the code of your virtual machine as well (such as primitives, model, ...). Every time you add a new concept or alter a concept, you run the tests. If tests fail, you fix the bug in the code, or if necessary you alter the test to reflect the reality. Do not just ignore the test unless it became obsolete for the new state of the code.

Secondly, if at any point you find a bug in your code (most likely a bug concerning derived semantics), for which there was no failing test yet, add a new failing test highlighting the bug before fixing it.

All this testing and keeping tests might seem very verbose and take up too much time. However, I can assure you that once you are far in the development cycle, you will be very happy to have them around. You will alter something of which you think it might be better, and then something else, and then something else. Once you get around running your interpreter on some code you notice that it breaks and you will have to find the bug. However, the bug might be coming from some object which you did not expect to be around at some point and you do not have a clue anymore where anything comes from. If you have tests, you can just run all tests after altering any piece of code, and if they cover enough ground, they will exactly tell you which object was or was not returned wrongly where. Or whatever else you might have done wrong. It will significantly reduce the amount of time you will spend headscratching and banging that same head onto the wall for not finding where something comes from.

Deciding on the object model

Every object-oriented programming language centers around a specific object-model. This model can go from very compact (as little as 2 classes; one representing arrays and one representing integers) to a whole range of purpose-specific classes.


More to come!