What I learned Today
Feb 2022: #
2022-02-09: #
- Now that we found the closed form solution for \( \hat{\beta} \) we can find it's mean and variance. Assume \( y = \textbf{x}\beta + \epsilon \) where \( \epsilon \sim N(0,\sigma^2) \). Therefore: \[ \hat{\beta} = (\textbf{x}^T\textbf{x})^{-1}\textbf{x}^T(\textbf{x}\beta + \epsilon) \] \[ =(\textbf{x}^T\textbf{x})^{-1}\textbf{x}^T\textbf{x}\beta + (\textbf{x}^T\textbf{x})^{-1}\textbf{x}^T\epsilon \] \[ =\beta + (\textbf{x}^T\textbf{x})^{-1}\textbf{x}^T\epsilon \] The second term here is a scaled Gaussian. Since \(\epsilon \) is distributed with a mean of 0: \[ E[(\textbf{x}^T\textbf{x})^{-1}\textbf{x}^T\epsilon] = (\textbf{x}^T\textbf{x})^{-1}\textbf{x}^TE[\epsilon] = 0 \] \[ E[\hat{\beta}] = \beta \] Furthermore by variance linear combination rules: \[ Var[\hat{\beta}] = Var[(\textbf{x}^T\textbf{x})^{-1}\textbf{x}^T\epsilon] = [(\textbf{x}^T\textbf{x})^{-1}\textbf{x}^T] Var[\epsilon] [(\textbf{x}^T\textbf{x})^{-1}\textbf{x}^T]^T \] \[ =[(\textbf{x}^T\textbf{x})^{-1}\textbf{x}^T] \sigma^2 [(\textbf{x}^T\textbf{x})^{-1}\textbf{x}^T]^T = \sigma^2(\textbf{x}^T\textbf{x})^{-1}\textbf{x}^T \textbf{x}(\textbf{x}^T\textbf{x})^{-1} \] \[ = \sigma^2 (\textbf{x}^T\textbf{x})^{-1} \] Therefore \( \hat{\beta} \) is distributed as a multivariate normal: \[ \hat{\beta} \sim \mathcal{N}(\beta,\sigma^2 (\textbf{x}^T\textbf{x})^{-1} ) \] and: \[ Var[\hat{\beta}] = \sigma^2 (\textbf{x}^T\textbf{x})^{-1} = \begin{bmatrix} Var(\hat{\beta_0}) & Cov(\hat{\beta_0},\hat{\beta_1}) \\ Cov(\hat{\beta_0},\hat{\beta_1}) & Var(\hat{\beta_1}) \end{bmatrix} \] \[ \sigma^2 (\textbf{x}^T\textbf{x})^{-1} = \sigma^2\begin{bmatrix} N & \sum_{i=1}^N x_i \\ \sum_{i=1}^N x_i & \sum_{i=1}^N x_i^2 \end{bmatrix}^{-1} = \sigma^2 \left(N\begin{bmatrix} 1 & \bar{x} \\ \bar{x} & \bar{x^2} \end{bmatrix}\right)^{-1} \] \[ =\sigma^2\frac{1}{N}\frac{1}{\bar{x^2} - \bar{x}^2}\begin{bmatrix} \bar{x^2} & -\bar{x} \\ -\bar{x} & 1 \end{bmatrix} \] Where \[ s^2_x = \frac{\sum (x_i - \bar{x})^2}{N} = \frac{\sum x^2 - \frac{(\sum x_i)^2}{N}}{N} = \bar{x^2} - \bar{x}^2 \] is sample population variance: \[ Var[\hat{\beta}]=\begin{bmatrix} \frac{\sigma^2\bar{x^2}}{N s^2_x} & \frac{-\sigma^2\bar{x}}{N s^2_x} \\ \frac{-\sigma^2\bar{x}}{N s^2_x} & \frac{\sigma^2}{N s^2_x} \end{bmatrix} \] Be careful of \( \bar{x^2} \) and \( \bar{x}^2 \). These are the means of entries squared and the squared mean respectively.
2022-02-08: #
- You can also find the variance of your estimated parameters. One example for a simple linear model with normal noise is: let \( X\beta \) be the predicted points given weights\( \beta \). \[ \textbf{x}\beta = \begin{bmatrix} \beta_0 + \beta_1x_1 \\ \beta_0 + \beta_1x_2 \\ \vdots\\ \beta_0 + \beta_1x_n \end{bmatrix} \] We saw from the previous exercise that the maximum likelihood estimates \( \hat{\beta} \) were equivalent to those produced from minimizing mean squared error. \[ \mathcal{L} = log\mathcal{P}(D|\beta,\sigma^2) = \sum_{i=1}^N (log \frac{1}{\sqrt{2 \pi \sigma^2}} \exp{-\frac{(y_i - x_i\beta)^2}{2\sigma^2}}) = -\frac{N}{2}log2\pi\sigma^2 -\sum_{i=1}^N \frac{(y_i - x_i\beta)^2}{2\sigma^2} \] \[ \hat{\beta} = argmax_\beta -\sum_{i=1}^N \frac{(y_i - x_i\beta)^2}{2\sigma^2} = argmin_\beta \sum_{i=1}^N \frac{(y_i - x_i\beta)^2}{2\sigma^2} = argmin_\beta \frac{1}{N}\sum_{i=1}^N (y_i - x_i\beta)^2 \] We can rewrite mean squared error with \( e = (\textbf{y} - \textbf{x}\beta) \) in matrix form as: \[ MSE = \frac{1}{N}e^Te = \frac{1}{N}(\textbf{y} - \textbf{x}\beta)^T(\textbf{y} - \textbf{x}\beta) = \frac{1}{N}(\textbf{y}^T-\beta^T\textbf{x}^T)(\textbf{y}-\textbf{x}\beta) \] \[ = \frac{1}{N}(\textbf{y}^T\textbf{y} - \textbf{y}^T\textbf{x}\beta - \beta^T\textbf{x}^T\textbf{y} + \beta^T\textbf{x}^T\textbf{x}\beta) = \frac{1}{N}(\textbf{y}^T\textbf{y} - 2\beta^T\textbf{x}^T\textbf{y} + \beta^T\textbf{x}^T\textbf{x}\beta) \] We can minimize this value by taking the gradient w.r.t \( \beta \): \[ \ abla MSE(\beta) = \frac{1}{N}(0 - 2\textbf{x}^T\textbf{y} + 2\textbf{x}^T\textbf{x}\beta) \] Setting this to 0, we can find the optimal solution: \[ \textbf{x}^T\textbf{x}\hat{\beta} - \textbf{x}^T\textbf{y} = 0 \] Isolating \(\hat{\beta} \) we find: \[ \hat{\beta} = (\textbf{x}^T\textbf{x})^{-1}\textbf{x}^T\textbf{y} \] Now you can find the variance (2022-02-09). Such a cliff hanger I known! :)
2022-02-07: #
- Logistic Regression. Mean squared error is not used in logistic regression. In binary classification we can say that \( \mathcal{P}(y=0|x) + \mathcal{P}(y=1|x) = 1 \) where: \( \mathcal{P}(y=0|x,w) = \frac{1}{1+ e^{-w^Tx}} \) and \( \mathcal{P}(y=1|x,w) = 1-\mathcal{P}(y=0|x)= \frac{e^{-w^Tx}}{1+ e^{-w^Tx}} \) The goal is to maximize the log-likelihood: \[ \mathcal{L} = \sum^N_{i=1} y_i (\frac{1}{1+ e^{-w^Tx}}) + (1-y_i)(\frac{e^{-w^Tx}}{1+ e^{-w^Tx}}) \] \[ \frac{\partial \mathcal{L}}{\partial w} = \sum^N_{i=1} (y_i - \mathcal{P}(y_i|x_i,w))x_i \]
We can see that this function is strictly non-convex, however the negative-log-likelihood is always convex. We instead minimize \( NNL = \sum^N_{i=1} -y_i log(\hat{y}) - (1-y_i)log(1-\hat{y}) \). Since MSE isn't convex for logistic regression gradient descent cannot be used effectively. \( \ abla MSE = \frac{1}{N} \sum^N_{i=1} -2(y_i - \hat{y})\hat{y}(1-\hat{y})x \). Both in theory can be minimized to find the optimal solution \( y = \hat{y} \), however in practice this is difficult. Furthermore MSE penalizes even correct classifications more so than log-loss does.
2022-02-06: #
- The estimators produced using likelihood functions, exactly match the ordinary least squares estimators. This is a special property of assuming independent Gaussian noise and is achieved when maximizing the likelihood function and minimizing the sum of squares. This can be seen where \( y = h_w(x) + \epsilon \) with \( \epsilon \sim \mathcal{N}(y,\sigma^2) \). \[ P(y|x,\omega, \sigma^2) = \mathcal{N}(y,,\sigma^{2}) = \mathcal{N}(\omega^T X,,\sigma^{2}I \] \[ log\mathcal{P}(D|\omega,\sigma^2) = \sum_{i=1}^N \mathcal{N}(\omega^T X,,\sigma^{2}I) \] \[ log\mathcal{P}(D|\omega,\sigma^2) = \sum_{i=1}^N (log \frac{1}{\sqrt{2 \pi \sigma^2}} \exp{-\frac{(y_i - x_iw)^2}{2\sigma^2}}) = -\frac{N}{2}log2\pi\sigma^2 -\sum_{i=1}^N \frac{(y_i - x_iw)^2}{2\sigma^2} \] \[ \omega_{ML} = argmax_w -\sum_{i=1}^N \frac{(y_i - x_iw)^2}{2\sigma^2} = argmin_w \sum_{i=1}^N \frac{(y_i - x_iw)^2}{2\sigma^2} = argmin_w \frac{1}{N}\sum_{i=1}^N (y_i - x_iw)^2 \]
Thus \( \omega_{ML} = (X^TX)^{-1}X^Ty \) maximizes likelihood and minimizes mean squared error.
Jan 2022 #
2022-01-29: #
- Now that we found the closed form solution for \( \hat{\beta} \) we can find it's mean and variance. Assume \( y = \textbf{x}\beta + \epsilon \) where \( \epsilon \sim N(0,\sigma^2) \). Therefore: \[ \hat{\beta} = (\textbf{x}^T\textbf{x})^{-1}\textbf{x}^T(\textbf{x}\beta + \epsilon) \] \[ =(\textbf{x}^T\textbf{x})^{-1}\textbf{x}^T\textbf{x}\beta + (\textbf{x}^T\textbf{x})^{-1}\textbf{x}^T\epsilon \] \[ =\beta + (\textbf{x}^T\textbf{x})^{-1}\textbf{x}^T\epsilon \] The second term here is a scaled Gaussian. Since \(\epsilon \) is distributed with a mean of 0: \[ E[(\textbf{x}^T\textbf{x})^{-1}\textbf{x}^T\epsilon] = (\textbf{x}^T\textbf{x})^{-1}\textbf{x}^TE[\epsilon] = 0 \] \[ E[\hat{\beta}] = \beta \] Furthermore by variance linear combination rules: \[ Var[\hat{\beta}] = Var[(\textbf{x}^T\textbf{x})^{-1}\textbf{x}^T\epsilon] = [(\textbf{x}^T\textbf{x})^{-1}\textbf{x}^T] Var[\epsilon] [(\textbf{x}^T\textbf{x})^{-1}\textbf{x}^T]^T \] \[ =[(\textbf{x}^T\textbf{x})^{-1}\textbf{x}^T] \sigma^2 [(\textbf{x}^T\textbf{x})^{-1}\textbf{x}^T]^T = \sigma^2(\textbf{x}^T\textbf{x})^{-1}\textbf{x}^T \textbf{x}(\textbf{x}^T\textbf{x})^{-1} \] \[ = \sigma^2 (\textbf{x}^T\textbf{x})^{-1} \] Therefore \( \hat{\beta} \) is distributed as a multivariate normal: \[ \hat{\beta} \sim \mathcal{N}(\beta,\sigma^2 (\textbf{x}^T\textbf{x})^{-1} ) \] and: \[ Var[\hat{\beta}] = \sigma^2 (\textbf{x}^T\textbf{x})^{-1} = \begin{bmatrix} Var(\hat{\beta_0}) & Cov(\hat{\beta_0},\hat{\beta_1}) \\ Cov(\hat{\beta_0},\hat{\beta_1}) & Var(\hat{\beta_1}) \end{bmatrix} \] \[ \sigma^2 (\textbf{x}^T\textbf{x})^{-1} = \sigma^2\begin{bmatrix} N & \sum_{i=1}^N x_i \\ \sum_{i=1}^N x_i & \sum_{i=1}^N x_i^2 \end{bmatrix}^{-1} = \sigma^2 \left(N\begin{bmatrix} 1 & \bar{x} \\ \bar{x} & \bar{x^2} \end{bmatrix}\right)^{-1} \] \[ =\sigma^2\frac{1}{N}\frac{1}{\bar{x^2} - \bar{x}^2}\begin{bmatrix} \bar{x^2} & -\bar{x} \\ -\bar{x} & 1 \end{bmatrix} \] Where \[ s^2_x = \frac{\sum (x_i - \bar{x})^2}{N} = \frac{\sum x^2 - \frac{(\sum x_i)^2}{N}}{N} = \bar{x^2} - \bar{x}^2 \] is sample population variance: \[ Var[\hat{\beta}]=\begin{bmatrix} \frac{\sigma^2\bar{x^2}}{N s^2_x} & \frac{-\sigma^2\bar{x}}{N s^2_x} \\ \frac{-\sigma^2\bar{x}}{N s^2_x} & \frac{\sigma^2}{N s^2_x} \end{bmatrix} \] Be careful of \( \bar{x^2} \) and \( \bar{x}^2 \). These are the means of entries squared and the squared mean respectively.
2022-01-28: #
- You can also find the variance of your estimated parameters. One example for a simple linear model with normal noise is: let \( X\beta \) be the predicted points given weights\( \beta \). \[ \textbf{x}\beta = \begin{bmatrix} \beta_0 + \beta_1x_1 \\ \beta_0 + \beta_1x_2 \\ \vdots\\ \beta_0 + \beta_1x_n \end{bmatrix} \] We saw from the previous exercise that the maximum likelihood estimates \( \hat{\beta} \) were equivalent to those produced from minimizing mean squared error. \[ \mathcal{L} = log\mathcal{P}(D|\beta,\sigma^2) = \sum_{i=1}^N (log \frac{1}{\sqrt{2 \pi \sigma^2}} \exp{-\frac{(y_i - x_i\beta)^2}{2\sigma^2}}) = -\frac{N}{2}log2\pi\sigma^2 -\sum_{i=1}^N \frac{(y_i - x_i\beta)^2}{2\sigma^2} \] \[ \hat{\beta} = argmax_\beta -\sum_{i=1}^N \frac{(y_i - x_i\beta)^2}{2\sigma^2} = argmin_\beta \sum_{i=1}^N \frac{(y_i - x_i\beta)^2}{2\sigma^2} = argmin_\beta \frac{1}{N}\sum_{i=1}^N (y_i - x_i\beta)^2 \] We can rewrite mean squared error with \( e = (\textbf{y} - \textbf{x}\beta) \) in matrix form as: \[ MSE = \frac{1}{N}e^Te = \frac{1}{N}(\textbf{y} - \textbf{x}\beta)^T(\textbf{y} - \textbf{x}\beta) = \frac{1}{N}(\textbf{y}^T-\beta^T\textbf{x}^T)(\textbf{y}-\textbf{x}\beta) \] \[ = \frac{1}{N}(\textbf{y}^T\textbf{y} - \textbf{y}^T\textbf{x}\beta - \beta^T\textbf{x}^T\textbf{y} + \beta^T\textbf{x}^T\textbf{x}\beta) = \frac{1}{N}(\textbf{y}^T\textbf{y} - 2\beta^T\textbf{x}^T\textbf{y} + \beta^T\textbf{x}^T\textbf{x}\beta) \] We can minimize this value by taking the gradient w.r.t \( \beta \): \[ \ abla MSE(\beta) = \frac{1}{N}(0 - 2\textbf{x}^T\textbf{y} + 2\textbf{x}^T\textbf{x}\beta) \] Setting this to 0, we can find the optimal solution: \[ \textbf{x}^T\textbf{x}\hat{\beta} - \textbf{x}^T\textbf{y} = 0 \] Isolating \(\hat{\beta} \) we find: \[ \hat{\beta} = (\textbf{x}^T\textbf{x})^{-1}\textbf{x}^T\textbf{y} \] Now you can find the variance (2022-02-09). Such a cliff hanger I known! :)
2022-01-27: #
- Logistic Regression. Mean squared error is not used in logistic regression. In binary classification we can say that \( \mathcal{P}(y=0|x) + \mathcal{P}(y=1|x) = 1 \) where: \( \mathcal{P}(y=0|x,w) = \frac{1}{1+ e^{-w^Tx}} \) and \( \mathcal{P}(y=1|x,w) = 1-\mathcal{P}(y=0|x)= \frac{e^{-w^Tx}}{1+ e^{-w^Tx}} \) The goal is to maximize the log-likelihood: \[ \mathcal{L} = \sum^N_{i=1} y_i (\frac{1}{1+ e^{-w^Tx}}) + (1-y_i)(\frac{e^{-w^Tx}}{1+ e^{-w^Tx}}) \] \[ \frac{\partial \mathcal{L}}{\partial w} = \sum^N_{i=1} (y_i - \mathcal{P}(y_i|x_i,w))x_i \]
We can see that this function is strictly non-convex, however the negative-log-likelihood is always convex. We instead minimize \( NNL = \sum^N_{i=1} -y_i log(\hat{y}) - (1-y_i)log(1-\hat{y}) \). Since MSE isn't convex for logistic regression gradient descent cannot be used effectively. \( \ abla MSE = \frac{1}{N} \sum^N_{i=1} -2(y_i - \hat{y})\hat{y}(1-\hat{y})x \). Both in theory can be minimized to find the optimal solution \( y = \hat{y} \), however in practice this is difficult. Furthermore MSE penalizes even correct classifications more so than log-loss does.
2022-01-26: #
- The estimators produced using likelihood functions, exactly match the ordinary least squares estimators. This is a special property of assuming independent Gaussian noise and is achieved when maximizing the likelihood function and minimizing the sum of squares. This can be seen where \( y = h_w(x) + \epsilon \) with \( \epsilon \sim \mathcal{N}(y,\sigma^2) \). \[ P(y|x,\omega, \sigma^2) = \mathcal{N}(y,,\sigma^{2}) = \mathcal{N}(\omega^T X,,\sigma^{2}I \] \[ log\mathcal{P}(D|\omega,\sigma^2) = \sum_{i=1}^N \mathcal{N}(\omega^T X,,\sigma^{2}I) \] \[ log\mathcal{P}(D|\omega,\sigma^2) = \sum_{i=1}^N (log \frac{1}{\sqrt{2 \pi \sigma^2}} \exp{-\frac{(y_i - x_iw)^2}{2\sigma^2}}) = -\frac{N}{2}log2\pi\sigma^2 -\sum_{i=1}^N \frac{(y_i - x_iw)^2}{2\sigma^2} \] \[ \omega_{ML} = argmax_w -\sum_{i=1}^N \frac{(y_i - x_iw)^2}{2\sigma^2} = argmin_w \sum_{i=1}^N \frac{(y_i - x_iw)^2}{2\sigma^2} = argmin_w \frac{1}{N}\sum_{i=1}^N (y_i - x_iw)^2 \]
Thus \( \omega_{ML} = (X^TX)^{-1}X^Ty \) maximizes likelihood and minimizes mean squared error.
2022-01-25: #
- Fun fact: the number of seconds in a year is within half a percent of π×10^7.
2022-01-24: #
- To address the shortcomings of a fixed suggestion algorithm, Faster R-CNN first feeds the image through a CNN to generate a feature map from which regions are then produced via selective search. By constructing the feature map first, only one convolution cycle occurs, reducing computation time. Again the regions are warped into squares and by a pooling layer are converted to a fixed size for classification using a fully connected layer.
- The slow process of selective searches creates a performance bottleneck. Faster R-CNN eliminates the search process by using an additional network to suggest regions once the image has passed through the first convolution cycle. The later classification stages are similar, but are performed less frequently and with higher success due to better region suggestions. Instance segmentation can be performed by assigning a set of predefined anchor boxes to each detected object. Boxes with the highest confidence (i.e. largest Intersection over Union) are selected.
2022-01-23: #
- Regional convolutional networks, and in particular the Faster R-CNN variant, use a region proposal network (RPN) to suggest bounding boxes for object detection. These regions are then classified using a CNN to detect the presence of objects. To avoid the problem of selecting an extrodiarily large number of regions to further analyze, Ross Girshick et al. suggests 2000 regions produced from a selective search where like regions are combined into larger ones. These regions are then warped into a square and inputted into a CNN to detect image features. These features can then be used to classify the presence of an object using Support Vector Machines.
2022-01-22: #
- Encoder-Decoder Models use a similar architecture to FCNs. Images are first encoded into a feature vector using a series of convolutional and pooling layers followed by the deconvolution network composed of unpooling and deconvolution layers. The decoder creates a probabilistic class pixel-wise mask. Some architectures such as SegNet are able to perform nonlinear upsampling by using the same pooling layers in the encoder section. This reduces trainable parameters and produces a upsampled map which can be convolved with a final trainable classification filter.
2022-01-21: #
- In conventional classification, an image is downsized using convolutional layers before being processed by a fully connected network. In contrast, a fully convolutional network (FCN) utilizes only convolutional layers. This flexibility allows for the network to scale to a variety of different sized images. Furthermore, once the image has been downscaled using convolutional and pooling layers, a pixelwise mask for image segmentation can be applied if the output is upsampled.
Deep convolutional layers can extract additional features, however spatial information is lost as layers become deeper. In order to counter act this, deep layers can be fused with shallower layers which have more spatial location information for enhanced performance.
Coarser, deeper layers can be upsampled before being combined with earlier layers. The result however is a much more accurate segmentation.
2022-01-20: #
- K-Means Clustering is an unsupervised clustering algorithm built on the idea of of forming tight groups of feature vectors given a set number of clusters.
- A number of clusters is chosen, each of which is initialized with a mean \( {\mu^{0}_1,\mu^{0}_2,…,\mu^{0}_n,} \)
- Feature vectors are arbitrarily assigned to a cluster.
- The initial means for each cluster are calculated in addition to the euclidean distance between each feature vector and the centers.
- Feature vectors are reassigned to the nearest cluster. \( Y_i = argmin_k Dist(\vec{x_i},\vec{\mu_k}) \)
- Cluster centers are recomputed. \( \vec{\mu_k} = \frac{1}{N_k} \sum_{i=k} \vec{x_i} \)
- Repeat stages 3 through 5 until clusters are stable and do not change further. The process can also be capped at a set number of iterations.
2022-01-19: #
- The Canny Edge Detector combines multiple steps to address some of the previously mentioned shortcomings. The algorithm begins by applying a Gaussian filter, smoothing the image and reducing noise. It then utilizes a first-order technique to calculate intensity gradients and identify possible edge candidates. A non-maximum suppression is performed by removing pixels that may not be part of an edge. In particular, a pixel must represent an edge maximum (by gradient magnitude) in relation to it's neighbors for a given direction in order to be considered an edge. Additional thresholding with hysteresis is applied to remove the weakest edges which could have been false positives due to noise in the image. This is done by applying two threshold values, an upper limit, which if satisfied an edge is declared, and a lower limit. If a pixel falls between the two, it is declared an edge if it is connected to a \"strong\" edge.
2022-01-18: #
- Edge Detection utilizes sharp discontinuities in gray-level intensities to define the boundary of the object.To identify edge segments, convolutions and weighted detector matrices are used. The Sobel Operator is a common approximation for gradient of the image. \[ M_x = \begin{bmatrix} -1 & 0 & 1 \\ -2 & 0 & 2 \\ -1 & 0 & 1 \end{bmatrix} \] \[ M_y = \begin{bmatrix} -1 & -2 & -1 \\ 0 & 0 & 0 \\ 1 & 2 & 1 \end{bmatrix} \]
- Laplacian of Gaussian (LoG) is a second-order derivative Gaussian-based approach. Here an edge occurs when the second derivative crosses zero. Similar to first order methods, the Laplacian is also subject to noise. By applying a Gaussian filter first, noise can be reduced prior to the application of the Laplacian.
2022-01-17 #
- A whole lot about threshold segmentation techniques. Otsu's Method is a pretty neat twist on Linear Discriminant Analysis / Fisher's Linear Discriminant.
- In the 2 class case when the data is projected across \( \vec{\omega} \), the classification boundary (Maximum Likelihood) is found when \( p(\vec{x}|\Sigma_0,\vec{\mu_0})= p(\vec{x}|\Sigma_1,\vec{\mu_1}) \). Alternatively, one can say an observation is from the first class if the log of the likelihood ratios is smaller than or equal to some threshold value. \[ \text{Likelihood ratio}= {\frac {{\sqrt {2\pi |\Sigma _{1}|}}^{-1}\exp \left(-{\frac {1}{2}}(x-\mu _{1})^{T}\Sigma _{1}^{-1}(x-\mu _{1})\right)}{{\sqrt {2\pi |\Sigma _{0}|}}^{-1}\exp \left(-{\frac {1}{2}}(x-\mu _{0})^{T}\Sigma _{0}^{-1}(x-\mu _{0})\right)}}>T \] For multiple classes, you can first model a conditional class distribution as a Gaussian. From there you find the prior class probabilities \( P(C_k) \). Using these find the posterior class probabilities for a given feature vector \( P(C_k | \vec{x}) \) and choose the highest. Rinse and repeat. This can be generalized to higher dimensions which makes for a surprisingly accurate technique for classification problems. Using it on the MNIST dataset for instance, projecting the dataset onto 2 dimensions you can still get around 56% accuracy, and close to 74% when increased to 3D.
\[ S(\vec{\omega})= \frac{\vec{\omega}^T \Sigma_B \vec{\omega}}{\vec{\omega}^T \Sigma_W \vec{\omega}} \]
\[ \vec{\omega} = max_{D'}(eig(\Sigma^{-1}_W \Sigma_B)) \]
Chances are I'll write some of this stuff up, so stay tuned.
2022-01-16: #
- The Johns Hopkins student population has very predictable visits to the library. My little investigation and tracker can be found here.
- Using
mmap
to map a fraction a page will instead return the next largest whole number of pages. - the v-node table is shared by all processes. How convenient!
2022-01-15: #
- Only 1% of global lithium is being mined and processed in the U.S. More than 80% of the world’s raw lithium is mined in Australia, Chile, and China. China accounts for more than half of the world’s lithium processing and refining. This should serve as a reminder to look into lithium stocks later.
- Domain name registration used to be free. Who knew.
- Using Singular Value Decomposition and Principal Component Analysis to perform a change of basis on a set of data is a remarkably simple yet effective way to reduce dimensionality