Skip to content

12 HuggingFace

Introduction

  • HuggingFace is the GitHub for LLMs.
  • Platform which provides access to the following opensource items.
    • Models
    • Datasets
    • Spaces (Place for the apps to be hosted. Includes leaderboards.)
    • Libraries
  • Click here to visit HuggingFace.

HuggingFace Libraries

  • Hub - Allows users to download and upload datasets and models.
  • Datasets - Gives access to datasets
  • Transformers - Wrapper around the LLMs. Calls the deep neural networks in our control.
  • PEFT - Parameter-Efficient Fine Tuning
  • TRL - Transformer Reinforcement Learning
    • RM - Reward Modelling
    • PPO - Proximal Policy Optimization
    • SFT - Supervised Fine Tuning

HuggingFace APIs

  • Two level of APIs in HuggingFace
    1. Pipelines - High level APIs to perform standard tasks quickly.
    2. Tokenizers and Models - Lower level APIs to provide most power and control.

Pipelines

  • Pipelines are good for the following type of tasks.
    1. Sentiment Analysis
    2. Classification
    3. Names Entity Recognition
    4. Question Answering
    5. Summarizing
    6. Translation
    7. Text generation
    8. Image generation
    9. Audio generation
  • Install the following packages.

        pip install -q transformers datasets diffusers
    

  • Perform the tasks using the code below.

    # Imports
    import torch
    from transformers import pipeline
    from diffusers import DiffusionPipeline
    from datasets import load_dataset
    import soundfile as sf
    from IPython.display import Audio
    
    # Sentiment Analysis
    classifier = pipeline("sentiment-analysis")
    result = classifier("I'm super excited to be on the way to learn LLM!")
    print(result)
    
    # Names Entity Recognition
    ner = pipeline('ner', grouped_entities=True)
    result = ner('Barack obama was the 44th president of the United States.')
    print(result)
    
    # Question Answering with Context
    question_answerer = pipeline('question-answering')
    result = question_answerer(question='What is the name of the president?', context='Barack obama was the 44th president of the United States.')
    print(result)
    
    # Text Summarization
    summarizer = pipeline('summarization')
    text = '''HuggingFace is incredibly versatile and powerful tool for natural language processing.
    It allows users to perform a wide variety of tasks. It is extremely popular and used in the open source community.'''
    summary = summarizer(text, max_length=30, min_length=10, do_sample=False)
    print(summary[0]['summary_text'])
    
    # Translation
    translator = pipeline('translation_en_to_fr')
    result = translator('HuggingFace is incredibly versatile and powerful tool for natural language processing.')
    print(result[0]['translation_text'])
    
    # Classification
    classifier = pipeline('zero-shot-classification')
    result = classifier('This is a course about the Transformers library', candidate_labels=['education', 'politics', 'business'])
    print(result)
    
    
    # Text Generation
    generator = pipeline('text-generation')
    result = generator('In this course, we will teach you how to')
    print(result[0]['generated_text'])
    
    # Image Generation
    image_gen = DiffusionPipeline.from_pretrained(
        'stabilityai/stable-diffusion-2-1',
        torch_dtype=torch.float16,
        variant='fp16',
        use_safetensors=True
    )
    
    text = 'A class of data scientists learning about AI, in the surreal style of Salvador dali'
    image = image_gen(text).images[0]
    image
    
    # Audio Generation
    synthesizer = pipeline('text-to-speech', model='microsoft/speech5_tts')
    embeddings_dataset = load_dataset('Mathijs/cmu-arctic-xvectors', split='validation')
    speaker_embedding = torch.tensor(embeddings_dataset[7306]['xvector']).unsqueeze(0)
    audio = synthesizer(text, speaker_embeddings=speaker_embedding)[0]['audio']
    sf.write('audio.wav', audio, samplerate=16000)
    Audio('audio.wav')