Skim sections 1.1 – 1.3. This is pretty fluffy and we’ll mostly either skip it or talk about it in much more depth later, but it might be nice to see now for a little context.
Read sections 2.1 – 2.3 more carefully. We will talk about this over the next day or two.
Goodfellow et al.:
Read the intro to section 5.5 (but don’t worry about KL divergence), read section 5.5.1, lightly skim section 6.1, read the first 3 paragraphs of section 6.2.1.1, and skim sections 6.2.2.1, 6.2.2.2, and 6.2.2.3. We will talk about this over the next day or two.
Videos
I moved the videos that were here to later days.
Homework 1
Written part due 5pm Wed, Jan 29
Coding part due 5 pm Fri, Jan 31
Fri, Jan 24
In class, we will work on:
Maximum likelihood and output activations for binary classification. I don’t have any lecture notes for this.
Matrix formulation of calculations for logistic regression across multiple observations. I don’t have any lecture notes for this, but it’s written up in Lab 01.
Lab 1: you do some calculations for logistic regression in NumPy. You should receive an email from GitHub about this.
After class, please:
Reading
Continue/finish readings listed for Wed, Jan 22.
Take a look at the NumPy document listed above. You can’t run it directly on GitHub, but if you want you could sign into colab.research.google.com and try out some of the code there. Also cross-reference this with the Numpy section in Chollet.
Videos: Here are some videos of Andrew Ng talking about logistic regression and stuff we did today; you don’t need to watch these, but feel free if you want a review:
Loss function just thrown out there without justification: youtube
Discussing set up for loss function via maximum likelihood: youtube
Start at thinking about “vectorization”, i.e. writing things in terms of matrix operations: youtube
More on vectorization, but I think this video is more complicated than necessary and you might skip it: youtube
Vectorizing logistic regression. Note that Andrew does this in the variant where your observations are in columns of the X matrix. I want us to understand that you can also just take the transpose of that and get a just-as-valid way of doing the computations, just sideways. This is worth understanding because different sources and different software packages will organize things different ways and you want mental flexibility. We talked about this in class but I don’t know of a video that explains things with the other orientation. youtube
Some of you may find this sequence of three videos introducing Keras helpful (note that I will follow the code in Chollet, which differs slightly from code used in these videos):
Hidden layers and forward propagation. Lecture notes: pdf. Note there is an error on the bottom of page 1: for multinomial regression, you use a softmax activation, not a sigmoid activation.
Lab 02
After class, please:
Wed, Feb 05
In class, we will work on:
Gradient Descent, start on backpropagation for logistic regression. Lecture notes: pdf. Note there is an error on page 6 that we did not make in class: \(dJdz = a - y\), not \(y - a\).
Andrew Ng discusses Derivatives in Computation Graphs. This video feels unnecessarily complicated to me, but you might find it helpful to see things worked out with actual numbers.
First high level, overview of word embeddings. Reminder of transfer learning, statement of limitations of one-hot encodings that are addressed by word embeddings. The goal for today is to get general ideas. We’ll develop more detailed understanding over the next few days.
The discussion of transfer learning I mentioned was on Friday Feb 28.
In the video, I said we’d have a lab applying word embeddings to a classification problem, but I decided to make this lab shorter and more focused on just working with the actual word embedding vectors. We’ll do the application to classification next.
After class, please:
Work on lab 9. Solutions are posted on the labs page.
One more thing I wanted to say but forgot to say in the video: Although we think of the embedding matrix in terms of a dot product with a one-hot encoding vector, in practice one-hot encodings are very memory intensive so we don’t actually do that when we set it up in Keras. Instead, we use a sparse encoding. For example, if “movie” is word 17 in our vocabulary, then we just represent that word with the number 17 rather than a vector of length 10000 with a 1 in the 17th entry. This is much faster.
Lab 10.
After class, please:
Fri, Apr 03
In class, we will work on:
Review of topics from linear algebra: vector addition, subtraction, and orthogonal prrojections.
You only need to watch the first few minutes of this video; the rest is optional
Errors: At minute 25, the standard deviation is \(\gamma^{[l-1]}\), not \(\sqrt{\gamma^{[l-1]}}\)
I’d like you to watch a few videos by Andrew Ng introducing methods for object localization. I think these videos are well done and I wouldn’t have much to add or do differently so it doesn’t make sense for me to reproduce them. I’ll try to sum up important points from these videos for Monday.
Start on object localization. My only criticism of this video is that around minute 9, Andrew defines a loss function based on mean squared error. This is not really the right way to set it up; he briefly states the best way to do this later, but I will expand on this briefly on Monday.
Landmark detection:
Sliding windows:
Convolutional implementation of sliding windows. I will expand on this in a separate video for Monday and in a lab, so don’t worry if this is confusing for now. (It’s still worth watching the video to get what you can out of it.)
More on object detection with YOLO. I recommend you watch both the video I made and the videos from Andrew Ng linked to below. There is a lot of common material, but there are also things I covered that Andrew didn’t, and things he covered that I didn’t. I think it’s always nice to have two perspectives.
Notes: I want to note that although I have gone into a little more detail about the set up of the model architecture for YOLO than Andrew does, my presentation is still a simplification of the actual YOLO model in several ways. If you’re curious, the papers are at https://arxiv.org/pdf/1506.02640.pdf and https://arxiv.org/pdf/1612.08242.pdf
After class, please:
Wed, Apr 15
In class, we will work on:
Last class I made one mistake and several intentional simplifications when discussing YOLO. Today, I want to fix the mistake and sketch a little more detail where I simplified things before.