Dec 12, 20209 min read ☕ (Last updated: Jun 30, 2024)

About ME

Profile

  • Service machine learning products in various domains, Audio & Speech, Vision, NLP, Recommendation Systems, Tabular, LLM applications in many startups.

  • Kaggle 2x Expert. the highest competition rank is top 0.1%.

  • Alternative Military Service Status : discharge (2020/11/27 ~ 2023/09/26)

  • CV : [PDF] (as of Jun. 2024)

Email kozistr@gmail.com
Github https://github.com/kozistr
Kaggle https://www.kaggle.com/kozistr
Linkedin https://www.linkedin.com/in/kozistr

Challenges & Awards

Machine Learning

Hacking

  • Boot2Root CTF 2018 :: 2nd place (Demon + alpha)

  • Harekaze CTF 2017 :: 3rd place (SeoulWesterns)

  • WhiteHat League 1 (2017) :: 2nd place (Demon)

    • Awarded by 한국정보기술연구원 Received an award of $3,000

Work Experience

  • 2023 - 2024 : Joined Sionic AI. Built enterprise-grade LLM applications.
  • 2021 - 2023 : Joined Viva Republica (Toss). Developed many products like BNPL, CSS, OCR, NPS, CDP, and in-house products.
  • 2020 - 2021 : Joined Watcha. Developed Watcha recommendation system, and contributed to other products like WatchaPedia, in-house applications.
  • 2019 - 2020 : Joined Rainist (Banksalad). Developed a transaction classifier service to analyze the categories with low latency, high accuracy, and in real-time.
  • 2019 : Joined VoyagerX. Developed a speaker diarization product that automatically recognizes the contents of the meeting.
  • - 2019 : offensive security stuffs. Mainly researched and studied Linux kernel exploitation and reverse engineering.

Company

Machine Learning Engineer, Sionic AI, (2023.10.23 ~ 2024.06.30)

  • Built enterprise-grade LLM applications.
    • Developed & Managed the advanced RAG pipeline & server and text embedding, reranker models, and vector database & inference servers.
    • Developed the utility services likewise category classification, keyword extraction, and summarization.
  • Worked as a full-time (early start-up member)

Data Scientist, Toss core, (2021.12.06 ~ 2023.09.27)

  • Develop TPS (Toss Profile Service) product.
  • Various models to boost Loan Comparison products.
    • Developed a CSS model only with non-financial data and it outperformed by about ~ 4%p (on the primary metric) compared with the previous method.
    • Developed models to predict loan approval and interest rate.
  • Various CSS models for the CB (Credit Bureau).
    • Developed a more accurate & robust CSS model that mainly targets the thin-filer and it outperformed about 15% compared with the previous method.
    • Developed a model that predicts consumer proposal status.
    • Developed a transaction classifier with finance-relevant category to utilize at the feature engineering to boost the performance of CSS model.
  • Classify the category of the user review for the NPS (Net Performer Score) product.
    • Developed the RESTful API server to infer the deep learning model for the batch job.
    • Saved analysis time and labor of the NPS team.
  • OCR model to break captchas for the automation product.
    • Developed the lightweight models (text detector & captcha classifier) for inference in real-time (about 1000 TPS for a batch transaction, 80 ~ 100 TPS for a sample on the CPU) and built the RESTful API server to serve the model in real-time on the CPU.
    • In the A/B test, google vision OCR vs New Captcha Model
      • Accuracy (top1) : improved 50%p (45% to 95%)
      • latency (p95) : reduced by 80x (about 1000ms to 12ms)
      • Revenue reduced cost by about $7,000 ~ / year
      • It also elaborates on decreasing a funnel and increasing user conversion.
  • User consumption forecasting model for the *CDP Product.
    • Developed the Transformer based sequential model that predicts what the users will consume in the next month.
    • Built an efficient pipeline to process and train lots of tabular data.
  • CSS model for BNPL (Buy Now Pay Later) service.
    • Developed the CSS model (default prediction), mainly targeted to the thin-filer. The new model achieved the targeted default rate of about 1%.
    • Developed the explainer to describe which factors affect the rejection.
  • Transaction category classification model to boost the advertisement.
    • Developed the ads category classifier that increases revenue in a roundabout way.
  • Internal product, The Slack bots that summarize the long threads.
    • help people to understand the context quickly with minimum effort.
    • summarize the weekly mail using ChatGPT with prompt engineering.
  • Worked as full-time.

% *CDP: Customer Data Platform. Lots of user segments generated by the ML models.

Machine Learning Researcher, Watcha, (2020.06.22 ~ 2021.12.03)

  • Watcha recommendation system to offer a better user experience and increase paid conversion.

    • Developed the advanced the training recipe & architecture to improve training stability and the performance. Also, working on post-processing to recommend unseen content to users. In the A/B test, the new model boosts the Click Ratio by about 1.01%+.
    • Developed the network to capture the active time of user while the augmentations bring the training stability and performance gain. In the A/B/C test, the new model beats Div2Vec in the online metrics while achieving comparable performance with the previous model (A: Div2Vec, B: the previous model, C: the new model).
      • *Viewing Days (mean): improved 1.012%+
      • *Viewing Minutes (median): improved 1.015%+
    • Developed the sequential recommendation architecture to recommend what content to watch next. It achieved SOTA performance compared to the previous SOTA architecture like BERT4Rec. In the A/B test, the new model outperforms by the following metrics (A: previous algorithm, B: the new model).
      • Paid Conversion : improved 1.39%p+
      • *Viewing Days (mean): improved 0.25%p+
      • *Viewing Minutes (median): improved 4.10%p+
      • Click Ratio : improved 4.30%p+
      • Play Ratio : improved 2.32%p+
  • Face recognition architecture to find actors from the poster & still-cut images for the Watcha Pedia product.

    • Developed the pipeline to identify & recognizing actor faces from the images with the face detection & identification deep learning models (similarity-based searching).
    • Built a daily job that runs on the CPU. Also, optimize CPU-intensive operations to run fast.
  • Internal product to predict expected users' view-time of the content.

    • Before the content is imported, the model gives an insight into the valuation of the content, like expected view-time affecting the cost of the content.
  • Internal product to help designer's works

    • Developed the image super-resolution model to upscale the image more accurately and faster (e.g., waifu).
  • Music recommendation system for Watcha Music (prototype)

  • Worked as full-time.

  • % *Viewing Days : how many days users are active on an app each month.

  • % *Viewing Minutes : how many minutes the user watched the content.

Machine Learning Engineer, Rainist, (2019.11.11 ~ 2020.06.19)

  • Transaction category classification application to identify the category for the convenience of user experience.
    • Developed the lightweight transaction category classification model. In the A/B test, the new model achieved 25 ~ 30%p+ *Accuracy improvement.
    • Developed the backends (e.g., model serving, business logic microservices) in Python.
      • Utilized inference-aware framework (ONNX) to goal stable and low latency.
      • Achieved a target latency of about 7 ~ 10 TPS (p50) while handling 1M transactions / day (1 transaction = 100 samples).
  • CSS model to forecast the possibility of loan overdue.
  • Worked as full-time.

% *Accuracy : how many users don't update their transactions' category.

Machine Learning Engineer, VoyagerX, (2019.01.07 ~ 2019.10.04)

  • Proceedings deep learning application which automatically recognizes speakers & speeches (speaker diarization).
    • Developed the backend to diarize the conversation.
    • Developed the lightweight speaker verification model (served at AWS Lambda).
    • Developed the on/offline speaker diarization based on clustering & E2E methods.
  • Hair Salon project to swap the hair with what the user wants naturally.
    • Developed a hair/face image segmentation model to identify hair & face accurately.
    • Developed image in-painting model to detach a hair.
    • Developed an I2I translation model to change the hairstyle.
  • Worked as an intern.

Penetration Tester, ELCID, (2016.07 ~ 2016.08)

  • Penetrated the network firewall and anti-virus products.
  • Worked as a part-time job.

Out Sourcing

  • Developed Korean University Course Information Web Parser (about 40 universities). 2 times, (2017.07, 2018.03)
  • Developed AWS CloudTrail logger analyzer. (2019.09 ~ 2019.10)

Lab

HPC Lab, KoreaTech, Undergraduate Researcher, (2018.09 ~ 2018.12)

  • Wrote a paper about an improved TextCNN model to predict a movie rate.

Publications

Paper

[1] Kim et al, CNN Architecture Predicting Movie Rating, 2020. 01.

  • Wrote about the CNN Architecture, which utilizes a channel-attention method (SE Module) to the TextCNN model, bringing performance gain over the task while keeping its latency, generally.
  • Handling un-normalized text with various convolution kernel sizes and spatial dropout
  • Selected as one of the highlight papers for the first half of 2020

Conferences/Workshops

[1] kozistr_team, presentation NAVER NLP Challenge 2018 SRL Task

  • SRL Task, challenging w/o any domain knowledge. Presented about trials & errors during the competition

Journals

[1] zer0day, Windows Anti-Debugging Techniques (CodeEngn 2016) Sep. 2016. PDF

  • Wrote about lots of anti-reversing / debugging (A to Z) techniques avail on window executable binary

Posts

[1] kozistr (as a part of team, Dragonsong) towarddatascience

  • Wrote about audio classifier with deep learning based on the Kaggle challenge where we participated

Personal Projects

Machine/Deep Learning

Generative Models

  • GANs-tensorflow :: Lots of GAN :: Generative Adversary Networks

    • ACGAN-tensorflow :: Auxiliary Classifier GAN in tensorflow :: code
    • StarGAN-tensorflow :: Unified GAN for multi-domain :: code
    • LAPGAN-tensorflow :: Laplacian Pyramid GAN in tensorflow :: code
    • BEGAN-tensorflow :: Boundary Equilibrium in tensorflow :: code
    • DCGAN-tensorflow :: Deep Convolutional GAN in tensorflow :: code
    • SRGAN-tensorflow :: Super-Resolution GAN in tensorflow :: code
    • WGAN-GP-tensorflow :: Wasserstein GAN w/ gradient penalty in tensorflow :: code
    • ... lots of GANs (over 20) :)

Super Resolution

  • Single Image Super Resolution :: Single Image Super-Resolution (SISR)

    • rcan-tensorflow :: RCAN implementation in tensorflow :: code
    • ESRGAN-tensorflow :: ESRGAN implementation in tensorflow :: code
    • NatSR-pytorch :: NatSR implementation in pytorch :: code

I2I Translation

  • Improved Content Disentanglement :: tuned version of 'Content Disentanglement' in pytorch :: code

Style Transfer

  • Image-Style-Transfer :: Image Neural Style Transfer

    • style-transfer-tensorflow :: Image Style-Transfer in tensorflow :: code

Text Classification/Generation

  • movie-rate-prediction :: Korean sentences classification in tensorflow :: code
  • KoSpacing-tensorflow :: Automatic Korean sentences spacing in tensorflow :: code
  • text-tagging :: Automatic Korean articles categories classification in tensorflow :: code

Speech Synthesis

  • Tacotron-tensorflow :: Text To Sound (TTS)

    • tacotron-tensorflow :: lots of TTS models in tensorflow :: code

Optimizer

  • pytorch-optimizer :: optimizer & lr scheduler collections in PyTorch

    • pytorch_optimizer :: pytorch-optimizer is optimizer & lr scheduler collections in PyTorch. I just re-implemented (speed & memory tweaks, plug-ins) the algorithm while based on the original paper. Also, It includes useful and practical optimization ideas. :: code
  • AdaBound :: Optimizer that trains as fast as Adam and as good as SGD

    • AdaBound-tensorflow :: AdaBound Optimizer implementation in tensorflow :: code
  • RAdam :: On The Variance Of The Adaptive Learning Rate And Beyond in tensorflow

    • RAdam-tensorflow :: RAdam Optimizer implementation in tensorflow :: code

R.L

  • Rosseta Stone :: Hearthstone simulator using C++ with some reinforcement learning :: code

Open Source Contributions

  • syzkaller :: New Generation of Linux Kernel Fuzzer :: #575
  • simpletransformers :: Transformers made simple w/ training, evaluating, and prediction possible w/ one line each. :: #290
  • pytorch-image-models :: PyTorch image models, scripts, pretrained weights :: #1058, #1069
  • deit :: DeiT: Data-efficient Image Transformers :: #140, #147, #148
  • MADGRAD :: MADGRAD Optimization Method :: #11
  • tensorflow-image-models :: TensorFlow Image Models (tfimm) is a collection of image models with pretrained weights, obtained by porting architectures from timm to TensorFlow :: #61
  • onnx2tf :: Self-Created Tools to convert ONNX files (NCHW) to TensorFlow/TFLite/Keras format (NHWC). The purpose of this tool is to solve the massive Transpose extrapolation problem in onnx-tensorflow (onnx-tf) :: #259
  • dadaptation :: D-Adaptation for SGD, Adam and AdaGrad :: #21
  • python-mastery :: Advanced Python Mastery :: #14
  • text-embedding-inference :: A blazing fast inference solution for text embeddings model :: #62, #285
  • langchain-ai :: Build context-aware reasoning applications :: #18839, #20057
  • qdrant :: Qdrant - High-performance, massive-scale Vector Database for the next generation of AI :: #3982
  • bfb :: high-load benchmarking tool :: #37
  • qdrant-web-ui :: Self-hosted web UI for Qdrant :: #191

Plug-Ins

IDA-pro plug-in - Golang ELF binary (x86, x86-64), RTTI parser

  • Recover stripped symbols & information and patch byte-codes for being able to hex-ray

Security, Hacking

CTFs, Conferences

  • POC 2016 Conference Staff
  • HackingCamp 15 CTF Staff, Challenge Maker
  • CodeGate 2017 OpenCTF Staff, Challenge Maker
  • HackingCamp 16 CTF Staff, Challenge Maker
  • POX 2017 CTF Staff, Challenge Maker
  • KID 2017 CTF Staff, Challenge Maker
  • Belluminar 2017 CTF Staff
  • HackingCamp 17 CTF Staff, Challenge Maker
  • HackingCamp 18 CTF Staff, Challenge Maker

Teams

Hacking Team, Fl4y. Since 2017.07 ~

Hacking Team, Demon by POC. Since 2014.02 ~ 2018.08


Educations

BS in Computer Engineering from KUT

Presentations

2018

[2] Artificial Intelligence ZeroToAll, Apr 2018.

[1] Machine Learning ZeroToAll, Mar 2018.

2015

[1] Polymorphic Virus VS AV Detection, Oct 2015.

2014

[1] Network Sniffing & Detection, Oct, 2014.