Today I’ll start writing my very first assembler and hopefully finish the first half of the NAND2Tetris course! I read chapter 6 and thoroughly reviewed the specifications for the week 6 project yesterday, so now I’m ready to get my hands dirty with translating assembly into machine code! Woohoo!
See all the code in my GitHub repository here. (Don’t look at it until after you’ve already completed the project yourself, though!)
Version 1 assembler without symbols
Following the suggestion in the book, I’m first building a version of the assembler that does not handle symbols at all. But where do I start? Ugh, my brain isn’t working today… I already had some caffeine, but it isn’t helping, and I can’t have more caffeine yet because it will probably upset my stomach!
OK, time to get my pen and paper. That usually helps get my brain unstuck.
Stuff I need to do in my assembler:
- Ignore comments and whitespace
- Classify command types for A-instructions and C-instructions (then handle user-defined labels in version 2)
- Extract numbers from A-instructions
- Convert numbers from decimal to binary
- Extract fields from C-instructions
- Convert fields to binary codes
- Assemble binary codes into machine code instructions, one per line
OK, I’m going to tackle each of these functions one at a time.
First, there’s the trim() method, which conveniently removes whitespace from both ends of a string. But I could write my assembly code like
D = D+1; JEQ and it should still work, so I want to remove all whitespace.
To do that, I need regular expressions! The special character
\s matches any whitespace character, including tabs and new lines and other stuff like that. I’ll use the replace() method with the global flag to remove every instance of one or more whitespace characters in my string. (To remove a piece of a string, I’m just replacing it with an empty string.)
So here’s my function:
I tested it out in my browser’s console and it works like a charm! This function will definitely come in handy for many future projects. So, note to self: be sure to talk about whitespace and regular expressions in any introductory programming classes I teach!
In the Hack assembly language, comments begin with
// and can appear on their own line or at the end of a line of assembly code. My assembler will have to remove the comments, because comments are only meant for human eyes and have no impact on how the code gets translated into machine language.
I can use the replace() method again with a regular expression to match everything from the
// to the end of the string, and just replace it with an empty string to remove it:
Regular expressions always look ridiculous, haha. I’m using backslashes to escape my forward slashes (
.* matches zero or more of any character and the
$ anchors it to the end of the string, so this regular expression matches anything beginning with
// and going until the end of the string.
OK, that wasn’t so hard after all! I’ve done regular expression before, so it’s good to know that I haven’t forgotten them completely.
Classifying Hack assembly commands
As specified in the book, the Hack computer’s assembly language has A-instructions like
@1337 and C-instructions like
M=D-1. I need to identify which is which, because that will determine how I parse the instruction code.
Since I’m not worrying about labels and I’ve already stripped out all the whitespace and comments, I think I can get away with classifying the instructions based solely on whether the first character in the string is
@ or not:
Extracting mnemonics from commands
Next, I have four functions for extracting the assembly code mnemonic out of an instruction string: one for the A-instructions, and one for each of the three fields of a C-instruction.
The first one seems super easy; I can just use the slice() method to chop off the first character:
Extracting fields from a C-instruction won’t be quite so easy, but it shouldn’t be difficult either.
I forgot exactly what the C-instructions can look like, so I dug through the book again and found some clarification:
- The default form of a C-instruction is
jumpfields can be empty
- If the
destfield is empty, the
- If the
jumpfield is empty, the
D=D+1;JEQ is valid, but so is
D=D+1. So my parser needs to look for optional equal signs and optional semicolons. Hmm…
Maybe I’m missing something, but this seems very simple: I’m always going to have two or three fields, and they will always be separated by either an equal sign, a semicolon, or one of each. If that’s true, I can just use the split() method a couple times to filter everything out and identify them this way:
This doesn’t seem like the most efficient way to do it, but it makes intuitive sense to me: I split it once to identify one possibility, then split it again to identify the other two possibilities.
Now I should be able to translate the fields into machine code with my other not-yet-written functions like this:
Next, I need to implement those little helper functions to take the Hack assembly mnemonics for destinations, computations and jumps and translate them into machine code.
Translating assembly mnemonics into machine code
Phew, that was tedious! But very easy.
Next, I’m going to make a little helper function for translating the addresses extracted from A-instructions from deciaml to binary. For this, I’ll use the parseInt() function and the toString() method to convert to decimal and then just pad the string with leading zeros as needed:
Hurray, it works! I forgot how to pad numbers at first, so that took me a couple tries.
Putting it all together!
Now I get to put all these functions to use! Hmm, but now that I think about it, the way I set up all these helper functions as methods of objects won’t play nice with my forEach loop. I really don’t need those extra objects, anyhow. I’ll just rename everything…
OK, everything is renamed. Just a bunch of functions now. Here’s the code for my assembler using all those helper functions:
Time to test it! I’m just going to run this code in my browser’s console by pasting the entire thing in there and typing in a string containing a couple of made-up assembly instructions.
Uh oh! It’s not working as expected. If I run
D=D+1;JEQ through my assembler, the output is
"1110011111010undefined\n". That isn’t right…
First, a quick fix for one issue: I should strip whitespace from the final string. OK, done.
Now for the weird part: why am I getting “undefined”?! Clearly, I need to throw in some
console.log() statements to see what’s going on in there.
15 minutes later: Found it! I made a typo of sorts and was trying to match the jump mnemonic codes against the destination mnemonic codes, and so my helper function was returning
undefined because it didn’t find a match in my hash table thingy. Fixed!
Now to test it with the provided test scripts, which include comments and whitespace…
Oh no! It isn’t happy with comments. Well, I need to stop for now and go to an event tonight. I’ll commit my changes and pick up where I left off later.
Learning summary (#TIL)
- Finished version 1 of my first assembler. It’s broken, but it’s a start!
- Finish building the Hack assembler to complete the project for week 6!
- Study time: 3 hours 34 min
- Working on project: 3 hours 34 min