Objects under 60% visibility are not annotated. These objects are marked as partially occluded. Occlusion: For partially occluded objects that do not belong a person class and are visible approximately 60% or are marked as visible objects with bounding box around visible part of the object. that do not alter the silhouette of the pedestrian significantly. For example, exclude a rolling bag if they are pulling it behind them and are distinctly visible as separate object. If a person is carrying an object please mark the bounding-box to include the carried object as long as it doesn’t affect the silhouette of the person. If you are looking to re-train with your own dataset, please follow the guideline below for highest accuracy.Īll objects that fall under one of the three classes (person, face, bag) in the image and are larger than the smallest bounding-box limit for the corresponding class (height >= 10px OR width >= 10px are labeled with the appropriate class label. Following guidelines were used while labelling the training data for NVIDIA PeopleNet model. The training dataset is created by labeling ground-truth bounding-boxes and categories by human labellers.
Training Data Ground-truth Labeling Guidelines This content was chosen to improve accuracy of the models for convenience-store retail analytics use-case. For this case, the camera is typically set up at approximately 10 feet height, 45-degree angle and has close field-of-view. Approximately half of the training data consisted of images captured in an indoor office environment. The training dataset consists of a mix of camera heights, crowd-density, and field-of view (FOV). PeopleNet v1.0 model was trained on a proprietary dataset with more than 17 million objects for person class. Regularization is not included in second and third phase. For quantized INT8 model a third quantization-aware training (QAT) phase is carried out. In the second phase the pruned network is retrained. Following the first phase, we prune the network removing channels whose kernel norms are below the pruning threshold. In the first phase, the network is trained with regularization to facilitate pruning.
The training is carried out in two phases. The training algorithm optimizes the network to minimize the localization and confidence loss for the objects. This model was trained using the DetectNet_v2 entrypoint in TAO. The raw normalized bounding-box and confidence detections need to be post-processed by a clustering algorithm such as DBSCAN or NMS to produce final bounding-box coordinates and category labels.
Gridbox system divides an input image into a grid which predicts four normalized bounding-box parameters (xc, yc, w, h) and confidence value per output class. This architecture, also known as GridBox object detection, uses bounding-box regression on a uniform grid on the input image. These models are based on NVIDIA DetectNet_v2 detector with ResNet34 as feature extractor. Three categories of objects detected by these models are – persons, bags and faces. The models described in this card detect one or more physical objects from three categories within an image and return a box around each object, as well as a category label for each object.